├── LICENSE ├── README.md ├── analysis.py ├── app.py ├── asset ├── Yingram.png └── overall.png ├── attentions.py ├── commons.py ├── configs ├── config_en.yaml ├── config_ja_22050.yaml └── config_ja_44100.yaml ├── data_utils.py ├── dataset └── preprocess.py ├── filelists ├── vctk_test_g2p.txt ├── vctk_train_g2p.txt └── vctk_val_g2p.txt ├── inference.py ├── losses.py ├── mel_processing.py ├── metadata_cleaners.py ├── models.py ├── modules.py ├── monotonic_align ├── __init__.py ├── core.pyx └── setup.py ├── pqmf.py ├── requirements.txt ├── text ├── __init__.py ├── cleaners.py ├── numbers.py └── symbols.py ├── train.py ├── transforms.py ├── utils.py └── yin.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 ㌧㌧ 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PITS(44100Hz 日本語対応版) 2 | **PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS** 3 | 4 | このリポジトリは、 44100Hzの日本語音声を学習および出力できるように編集した[PITS](https://github.com/anonymous-pits/pits)です。初期状態ではベクトル量子化処理無しのPITS(A+D)版ですが、models.pyのfor Q optionと記載されている部分のコードを数行変更すれば、PITS(A+D+Q)版へと変更が可能です。 5 | 6 | ![overall](asset/overall.png) 7 | 8 | ## 1. 環境構築 9 | Anacondaによる実行環境構築を想定します。 10 | 11 | 1. Anacondaで"PITS"という名前の仮想環境を作成する。[y]or nを聞かれたら[y]を入力する。 12 | ```sh 13 | conda create -n PITS python=3.8 14 | ``` 15 | 1. 仮想環境を有効化する。 16 | ```sh 17 | conda activate PITS 18 | ``` 19 | 1. このリポジトリをクローンする(もしくはDownload Zipでダウンロードする) 20 | ```sh 21 | git clone https://github.com/tonnetonne814/PITS-44100-Ja.git 22 | cd PITS-44100-Ja # フォルダ移動 23 | ``` 24 | 1. [PyTorch.org](https://pytorch.org/)より、自分の環境に合わせてPyTorchをインストールする 25 | ```sh 26 | # OS=Linux, CUDA=11.7 の例 27 | pip3 install torch torchvision torchaudio 28 | ``` 29 | 1. その他、必要なパッケージをインストールする。 30 | ```sh 31 | pip install -r requirements.txt 32 | ``` 33 | 1. Monotonoic Alignment Searchをビルドする。 34 | ```sh 35 | cd monotonic_align 36 | mkdir monotonic_align 37 | python setup.py build_ext --inplace 38 | ``` 39 | ## 2. データセットの準備 40 | [JVSコーパス](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)による、parallel100(話者間で共通する読み上げ音声 100 発話)、及びnonpara30(話者間で全く異なる読み上げ音声 30 発話)の学習を想定します。 41 | 42 | 1. [こちら](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)からJVSコーパスをダウンロード&解凍する。 43 | 1. 発話音声ファイルのサンプリングレートを44100Hzに変更する。path/to/〜となっている部分は適宜変更する。 44 | ```sh 45 | python3 ./dataset/preprocess.py --folder_path path/to/jvs_ver1/ --sampling_rate 44100 46 | ``` 47 | 48 | > ⚠path/to/jvs_ver1/ には、jvsコーパスの各話者の発話フォルダ[jvs001,jvs002, ... ,jvs100]が格納されているフォルダパスを指定する。 49 | 50 | ## 3. [configs](configs)フォルダ内のjsonを編集 51 | 主要なパラメータを下表に記載します。 52 | 53 | | 分類 | パラメータ名 | 説明 | 54 | |:-----:|:-----------------:|:------------------------------------------------------------------------------------------:| 55 | | train | log_interval | 指定ステップ毎にロスを算出し記録する | 56 | | train | eval_interval | 指定ステップ毎にモデル評価を行う | 57 | | train | save_interval | 指定ステップ毎にモデル保存を行う | 58 | | train | epochs | 学習データ全体の学習回数 | 59 | | train | batch_size | 一回のパラメータ更新に使用する学習データの数 | 60 | | data | data_path | jvs話者フォルダが格納されているフォルダパス(preprocess.pyで使用したpath/to/jvs_ver1/の値)| 61 | | data | training_files | 学習用filelistのテキストパス | 62 | | data | validation_files | 検証用filelistのテキストパス | 63 | | data | speakers | 話者名のリスト | 64 | 65 | config_ja_44100.yaml内の、data部分のdata_pathの値を、「2. データセットの準備」部分のpreprocess.pyに使用したjvsフォルダパスに書き換えます。 66 | 67 | ## 4. 学習 68 | 44100HzでのPITS(A+D)版の学習を想定します。Terminalに以下を入力し、学習を開始する。path/to/〜となっている部分は適宜変更する。 69 | ```sh 70 | python3 train.py --config ./configs/config_ja_44100.yaml --model PITS_A+D 71 | # 途中から学習を開始する場合は、--resume path/to/checkpoint.pt を追加する 72 | ``` 73 | このとき、nonpara30について、書き起こし文(transcripts_utf8.txt)と実際に格納されているwavファイルが一致しないものは除外する処理が入っています。 74 | 75 | 学習経過はターミナルにも表示されるが、tensorboardを用いて確認することで、生成音声の視聴や、スペクトログラムやYingramや各ロス遷移を目視で確認することができる。 76 | ```sh 77 | tensorboard --logdir ./logs/PITS_A+D/ 78 | ``` 79 | 80 | ## 5. 推論 81 | 推論を行う場合は、Terminalに以下を入力する。path/to/〜となっている部分は適宜変更する。 82 | ```sh 83 | python3 inference.py --config path/to/config.yaml --model PITS_A+D --model_path path/to/checkpoint.pth 84 | ``` 85 | Terminalにて、話者名や、読み上げテキスト、ピッチシフト数(整数)を入力することで、音声が生成さされます。音声は自動的に再生され、infer_logsフォルダ(存在しない場合は自動作成)に保存されます。 86 | 87 | ## 6.ファインチューニング 88 | 1. ファインチューニング用のfilelist等を作成する必要があります。 89 | ./filelist/*.txtの中身を参考に作成してください。形式としては、以下のようになっています。 90 | ```sh 91 | 話者フォルダからwavファイルまでのパス | 発話テキスト | 話者名 92 | ``` 93 | 2. config.yaml内のtraining_filesと、validation_filesのパスを、作成したリストに書き換えます。 94 | 3. config.yaml内のspeakers部分に、話者名を記載(追加でも変更でも良い)する。 95 | 5. Terminalに以下を入力し、ファインチューニングを実行する。path/to/〜となっている部分は適宜変更する。 96 | ```sh 97 | python3 train.py --config path/to/config.yaml --model PITS_A+D_finetune --force_resume path/to/checkpoint.pt 98 | ``` 99 | 100 | ## 事前学習モデル 101 | JVSコーパスを22050Hz版は150Epoch、44100Hz版は250Epoch程度学習した程度のモデルです。喋ったりピッチを変更する程度には問題ないとは思いますが、**学習不足である**ことに注意して使用してください。 102 | 103 | **ダウンロード** : 104 | [PITS(A+D)22050Hz版](https://drive.google.com/file/d/18eOyh8yEqryYTTssA1yUetbs66SbtlrM/view?usp=share_link) [PITS(A+D)44100版](https://drive.google.com/file/d/1AfSUHXkZ20_i_zwN8-f6I122pJd3Zb32/view?usp=share_link) 105 | 106 | ## 付録(Yingramの可視化) 107 | >Yingram, an acoustic feature inspired by [YIN algorithm [22]](http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf) that captures pitch information including harmonics.Yingram is designed to address the limitations of extracting f0, which is not well-defined in some cases [[23]](https://arxiv.org/abs/1910.10235), and the Yingram-based model shows better preference than the f0-based model [[16]](https://arxiv.org/abs/2110.14513). 108 | >> DeepL : Yingramは、[YINアルゴリズム[22]](http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf)にインスパイアされた音響特徴で、倍音を含むピッチ情報を捉えます。Yingramは、場合によってはうまく定義できないf0を抽出する限界に対処するために設計され[[23]](https://arxiv.org/abs/1910.10235)、Yingramベースのモデルはf0ベースのモデルよりも優れた選好性を示します[[16]](https://arxiv.org/abs/2110.14513)。 109 | 110 | ![overall](asset/Yingram.png) 111 | 112 | ## 参考文献 113 | - Official PITS Implementation; https://github.com/anonymous-pits/pits 114 | - Official VITS Implementation: https://github.com/jaywalnut310/vits 115 | - NANSY Implementation from dhchoi99: https://github.com/dhchoi99/NANSY 116 | - Official Avocodo Implementation: https://github.com/ncsoft/avocodo 117 | - Official PhaseAug Implementation: https://github.com/mindslab-ai/phaseaug 118 | - Tacotron Implementation from keithito: https://github.com/keithito/tacotron 119 | - CSTR VCTK Corpus (version 0.92): https://datashare.ed.ac.uk/handle/10283/3443 120 | - G2P for demo, g2p\_en from Kyubyong: https://github.com/Kyubyong/g2p 121 | - ESPNet:end-to-end speech processing toolkit: https://github.com/espnet/espnet -------------------------------------------------------------------------------- /analysis.py: -------------------------------------------------------------------------------- 1 | # modified from https://github.com/dhchoi99/NANSY 2 | # We have modified the implementation of dhchoi99 to be fully differentiable. 3 | import math 4 | import torch 5 | from yin import * 6 | 7 | 8 | class Pitch(torch.nn.Module): 9 | 10 | def __init__( 11 | self, 12 | sr=22050, 13 | w_step=256, 14 | W=2048, 15 | tau_max=2048, 16 | midi_start=5, 17 | midi_end=85, 18 | octave_range=12): 19 | super(Pitch, self).__init__() 20 | self.sr = sr 21 | self.w_step = w_step 22 | self.W = W 23 | self.tau_max = tau_max 24 | self.unfold = torch.nn.Unfold((1, self.W), 25 | 1, 26 | 0, 27 | stride=(1, self.w_step)) 28 | midis = list(range(midi_start, midi_end)) 29 | self.len_midis = len(midis) 30 | c_ms = torch.tensor([self.midi_to_lag(m, octave_range) for m in midis]) 31 | self.register_buffer('c_ms', c_ms) 32 | self.register_buffer('c_ms_ceil', torch.ceil(self.c_ms).long()) 33 | self.register_buffer('c_ms_floor', torch.floor(self.c_ms).long()) 34 | 35 | def midi_to_lag(self, m: int, octave_range: float = 12): 36 | """converts midi-to-lag, eq. (4) 37 | 38 | Args: 39 | m: midi 40 | sr: sample_rate 41 | octave_range: 42 | 43 | Returns: 44 | lag: time lag(tau, c(m)) calculated from midi, eq. (4) 45 | 46 | """ 47 | f = 440 * math.pow(2, (m - 69) / octave_range) 48 | lag = self.sr / f 49 | return lag 50 | 51 | def yingram_from_cmndf(self, cmndfs: torch.Tensor) -> torch.Tensor: 52 | """ yingram calculator from cMNDFs(cumulative Mean Normalized Difference Functions) 53 | 54 | Args: 55 | cmndfs: torch.Tensor 56 | calculated cumulative mean normalized difference function 57 | for details, see models/yin.py or eq. (1) and (2) 58 | ms: list of midi(int) 59 | sr: sampling rate 60 | 61 | Returns: 62 | y: 63 | calculated batch yingram 64 | 65 | 66 | """ 67 | #c_ms = np.asarray([Pitch.midi_to_lag(m, sr) for m in ms]) 68 | #c_ms = torch.from_numpy(c_ms).to(cmndfs.device) 69 | 70 | y = (cmndfs[:, self.c_ms_ceil] - 71 | cmndfs[:, self.c_ms_floor]) / (self.c_ms_ceil - self.c_ms_floor).unsqueeze(0) * ( 72 | self.c_ms - self.c_ms_floor).unsqueeze(0) + cmndfs[:, self.c_ms_floor] 73 | return y 74 | 75 | def yingram(self, x: torch.Tensor): 76 | """calculates yingram from raw audio (multi segment) 77 | 78 | Args: 79 | x: raw audio, torch.Tensor of shape (t) 80 | W: yingram Window Size 81 | tau_max: 82 | sr: sampling rate 83 | w_step: yingram bin step size 84 | 85 | Returns: 86 | yingram: yingram. torch.Tensor of shape (80 x t') 87 | 88 | """ 89 | # x.shape: t -> B,T, B,T = x.shape 90 | B, T = x.shape 91 | w_len = self.W 92 | 93 | 94 | frames = self.unfold(x.view(B, 1, 1, T)) 95 | frames = frames.permute(0, 2, 96 | 1).contiguous().view(-1, 97 | self.W) #[B* frames, W] 98 | # If not using gpu, or torch not compatible, implemented numpy batch function is still fine 99 | dfs = differenceFunctionTorch(frames, frames.shape[-1], self.tau_max) 100 | cmndfs = cumulativeMeanNormalizedDifferenceFunctionTorch( 101 | dfs, self.tau_max) 102 | yingram = self.yingram_from_cmndf(cmndfs) #[B*frames,F] 103 | yingram = yingram.view(B, -1, self.len_midis).permute(0, 2, 104 | 1) # [B,F,T] 105 | return yingram 106 | 107 | def crop_scope(self, x, yin_start, 108 | scope_shift): # x: tensor [B,C,T] #scope_shift: tensor [B] 109 | return torch.stack([ 110 | x[i, yin_start + scope_shift[i]:yin_start + self.yin_scope + 111 | scope_shift[i], :] for i in range(x.shape[0]) 112 | ], 113 | dim=0) 114 | 115 | 116 | if __name__ == '__main__': 117 | import torch 118 | import librosa as rosa 119 | import matplotlib.pyplot as plt 120 | wav = torch.tensor(rosa.load('LJ001-0002.wav', sr=22050, 121 | mono=True)[0]).unsqueeze(0) 122 | # wav = torch.randn(1,40965) 123 | 124 | wav = torch.nn.functional.pad(wav, (0, (-wav.shape[1]) % 256)) 125 | # wav = wav[#:,:8096] 126 | print(wav.shape) 127 | pitch = Pitch() 128 | 129 | with torch.no_grad(): 130 | ps = pitch.yingram(torch.nn.functional.pad(wav, (1024, 1024))) 131 | ps = torch.nn.functional.pad(ps, (0, 0, 8, 8), mode='replicate') 132 | print(ps.shape) 133 | spec = torch.stft(wav, 1024, 256, return_complex=False) 134 | print(spec.shape) 135 | plt.subplot(2, 1, 1) 136 | plt.pcolor(ps[0].numpy(), cmap='magma') 137 | plt.colorbar() 138 | plt.subplot(2, 1, 2) 139 | plt.pcolor(ps[0][15:65, :].numpy(), cmap='magma') 140 | plt.colorbar() 141 | plt.show() 142 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import gradio as gr 2 | import argparse 3 | import torch 4 | import commons 5 | import utils 6 | from models import ( 7 | SynthesizerTrn, ) 8 | 9 | from text.symbols import symbol_len, lang_to_dict 10 | 11 | # we use Kyubyong/g2p for demo instead of our internal g2p 12 | # https://github.com/Kyubyong/g2p 13 | from g2p_en import G2p 14 | import re 15 | 16 | _symbol_to_id = lang_to_dict("en_US") 17 | 18 | class GradioApp: 19 | 20 | def __init__(self, args): 21 | self.hps = utils.get_hparams_from_file(args.config) 22 | self.device = "cpu" 23 | self.net_g = SynthesizerTrn(symbol_len(self.hps.data.languages), 24 | self.hps.data.filter_length // 2 + 1, 25 | self.hps.train.segment_size // 26 | self.hps.data.hop_length, 27 | midi_start=-5, 28 | midi_end=75, 29 | octave_range=24, 30 | n_speakers=len(self.hps.data.speakers), 31 | **self.hps.model).to(self.device) 32 | _ = self.net_g.eval() 33 | _ = utils.load_checkpoint(args.checkpoint_path, model_g=self.net_g) 34 | self.g2p = G2p() 35 | self.interface = self._gradio_interface() 36 | 37 | def get_phoneme(self, text): 38 | phones = [re.sub("[0-9]", "", p) for p in self.g2p(text)] 39 | tone = [0 for p in phones] 40 | if self.hps.data.add_blank: 41 | text_norm = [_symbol_to_id[symbol] for symbol in phones] 42 | text_norm = commons.intersperse(text_norm, 0) 43 | tone = commons.intersperse(tone, 0) 44 | else: 45 | text_norm = phones 46 | text_norm = torch.LongTensor(text_norm) 47 | tone = torch.LongTensor(tone) 48 | return text_norm, tone, phones 49 | 50 | def inference(self, text, speaker_id_val, seed, scope_shift, duration): 51 | seed = int(seed) 52 | scope_shift = int(scope_shift) 53 | torch.manual_seed(seed) 54 | text_norm, tone, phones = self.get_phoneme(text) 55 | x_tst = text_norm.to(self.device).unsqueeze(0) 56 | t_tst = tone.to(self.device).unsqueeze(0) 57 | x_tst_lengths = torch.LongTensor([text_norm.size(0)]).to(self.device) 58 | speaker_id = torch.LongTensor([speaker_id_val]).to(self.device) 59 | decoder_inputs,*_ = self.net_g.infer_pre_decoder( 60 | x_tst, 61 | t_tst, 62 | x_tst_lengths, 63 | sid=speaker_id, 64 | noise_scale=0.667, 65 | noise_scale_w=0.8, 66 | length_scale=duration, 67 | scope_shift=scope_shift) 68 | audio = self.net_g.infer_decode_chunk( 69 | decoder_inputs, sid=speaker_id)[0, 0].data.cpu().float().numpy() 70 | del decoder_inputs, 71 | return phones, (self.hps.data.sampling_rate, audio) 72 | 73 | 74 | def _gradio_interface(self): 75 | title = "PITS Demo" 76 | self.inputs = [ 77 | gr.Textbox(label="Text (150 words limitation)", 78 | value="This is demo page.", 79 | elem_id="tts-input"), 80 | gr.Dropdown(list(self.hps.data.speakers), 81 | value="p225", 82 | label="Speaker Identity", 83 | type="index"), 84 | gr.Slider(0, 65536, value=0, step=1, label="random seed"), 85 | gr.Slider(-15, 15, value=0, step=1, label="scope-shift"), 86 | gr.Slider(0.5, 2., value=1., step=0.1, 87 | label="duration multiplier"), 88 | ] 89 | self.outputs = [ 90 | gr.Textbox(label="Phonemes"), 91 | gr.Audio(type="numpy", label="Output audio") 92 | ] 93 | description = "Welcome to the Gradio demo for PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS.\n In this demo, we utilize an open-source G2P library (g2p_en) with stress removing, instead of our internal G2P.\n You can fix the latent z by controlling random seed.\n You can shift the pitch scope, but please note that this is opposite to pitch-shift. In addition, it is cropped from fixed z so please check pitch-controllability by comparing with normal synthesis.\n Thank you for trying out our PITS demo!" 94 | article = "Github:https://github.com/anonymous-pits/pits \n Our current preprint contains several errors. Please wait for next update." 95 | examples = [["This is a demo page of the PITS."],["I love hugging face."]] 96 | return gr.Interface( 97 | fn=self.inference, 98 | inputs=self.inputs, 99 | outputs=self.outputs, 100 | title=title, 101 | description=description, 102 | article=article, 103 | cache_examples=False, 104 | examples=examples, 105 | ) 106 | 107 | def launch(self): 108 | return self.interface.launch(share=False) 109 | 110 | 111 | def parsearg(): 112 | parser = argparse.ArgumentParser() 113 | parser.add_argument('-c', 114 | '--config', 115 | type=str, 116 | default="./configs/config_en.yaml", 117 | help='Path to configuration file') 118 | parser.add_argument('-m', 119 | '--model', 120 | type=str, 121 | default='PITS', 122 | help='Model name') 123 | parser.add_argument('-r', 124 | '--checkpoint_path', 125 | type=str, 126 | default='./logs/pits_vctk_AD_3000.pth', 127 | help='Path to checkpoint for resume') 128 | parser.add_argument('-f', 129 | '--force_resume', 130 | type=str, 131 | help='Path to checkpoint for force resume') 132 | parser.add_argument('-d', 133 | '--dir', 134 | type=str, 135 | default='/DATA/audio/pits_samples', 136 | help='root dir') 137 | args = parser.parse_args() 138 | return args 139 | 140 | if __name__ == "__main__": 141 | args = parsearg() 142 | app = GradioApp(args) 143 | app.launch() 144 | -------------------------------------------------------------------------------- /asset/Yingram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tonnetonne814/PITS-44100-Ja/8283da256f0fd43394ccb6f45497f59021e29e41/asset/Yingram.png -------------------------------------------------------------------------------- /asset/overall.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tonnetonne814/PITS-44100-Ja/8283da256f0fd43394ccb6f45497f59021e29e41/asset/overall.png -------------------------------------------------------------------------------- /attentions.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import math 3 | import torch 4 | from torch import nn 5 | from torch.nn import functional as F 6 | 7 | import commons 8 | from modules import LayerNorm 9 | 10 | 11 | class Encoder(nn.Module): 12 | def __init__( 13 | self, 14 | hidden_channels, 15 | filter_channels, 16 | n_heads, 17 | n_layers, 18 | kernel_size=1, 19 | p_dropout=0., 20 | window_size=4, 21 | **kwargs 22 | ): 23 | super().__init__() 24 | self.hidden_channels = hidden_channels 25 | self.filter_channels = filter_channels 26 | self.n_heads = n_heads 27 | self.n_layers = n_layers 28 | self.kernel_size = kernel_size 29 | self.p_dropout = p_dropout 30 | self.window_size = window_size 31 | 32 | self.drop = nn.Dropout(p_dropout) 33 | self.attn_layers = nn.ModuleList() 34 | self.norm_layers_1 = nn.ModuleList() 35 | self.ffn_layers = nn.ModuleList() 36 | self.norm_layers_2 = nn.ModuleList() 37 | for i in range(self.n_layers): 38 | self.attn_layers.append( 39 | MultiHeadAttention( 40 | hidden_channels, 41 | hidden_channels, 42 | n_heads, 43 | p_dropout=p_dropout, 44 | window_size=window_size 45 | ) 46 | ) 47 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 48 | self.ffn_layers.append( 49 | FFN( 50 | hidden_channels, 51 | hidden_channels, 52 | filter_channels, 53 | kernel_size, 54 | p_dropout=p_dropout 55 | ) 56 | ) 57 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 58 | 59 | def forward(self, x, x_mask): 60 | attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 61 | x = x * x_mask 62 | for i in range(self.n_layers): 63 | y = self.attn_layers[i](x, x, attn_mask) 64 | y = self.drop(y) 65 | x = self.norm_layers_1[i](x + y) 66 | 67 | y = self.ffn_layers[i](x, x_mask) 68 | y = self.drop(y) 69 | x = self.norm_layers_2[i](x + y) 70 | x = x * x_mask 71 | return x 72 | 73 | 74 | class Decoder(nn.Module): 75 | def __init__( 76 | self, 77 | hidden_channels, 78 | filter_channels, 79 | n_heads, 80 | n_layers, 81 | kernel_size=1, 82 | p_dropout=0., 83 | proximal_bias=False, 84 | proximal_init=True, 85 | **kwargs 86 | ): 87 | super().__init__() 88 | self.hidden_channels = hidden_channels 89 | self.filter_channels = filter_channels 90 | self.n_heads = n_heads 91 | self.n_layers = n_layers 92 | self.kernel_size = kernel_size 93 | self.p_dropout = p_dropout 94 | self.proximal_bias = proximal_bias 95 | self.proximal_init = proximal_init 96 | 97 | self.drop = nn.Dropout(p_dropout) 98 | self.self_attn_layers = nn.ModuleList() 99 | self.norm_layers_0 = nn.ModuleList() 100 | self.encdec_attn_layers = nn.ModuleList() 101 | self.norm_layers_1 = nn.ModuleList() 102 | self.ffn_layers = nn.ModuleList() 103 | self.norm_layers_2 = nn.ModuleList() 104 | for i in range(self.n_layers): 105 | self.self_attn_layers.append( 106 | MultiHeadAttention( 107 | hidden_channels, 108 | hidden_channels, 109 | n_heads, 110 | p_dropout=p_dropout, 111 | proximal_bias=proximal_bias, 112 | proximal_init=proximal_init 113 | ) 114 | ) 115 | self.norm_layers_0.append(LayerNorm(hidden_channels)) 116 | self.encdec_attn_layers.append( 117 | MultiHeadAttention( 118 | hidden_channels, 119 | hidden_channels, 120 | n_heads, 121 | p_dropout=p_dropout 122 | ) 123 | ) 124 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 125 | self.ffn_layers.append( 126 | FFN( 127 | hidden_channels, 128 | hidden_channels, 129 | filter_channels, 130 | kernel_size, 131 | p_dropout=p_dropout, 132 | causal=True 133 | ) 134 | ) 135 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 136 | 137 | def forward(self, x, x_mask, h, h_mask): 138 | """ 139 | x: decoder input 140 | h: encoder output 141 | """ 142 | self_attn_mask = commons.subsequent_mask( 143 | x_mask.size(2) 144 | ).to(device=x.device, dtype=x.dtype) 145 | encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 146 | x = x * x_mask 147 | for i in range(self.n_layers): 148 | y = self.self_attn_layers[i](x, x, self_attn_mask) 149 | y = self.drop(y) 150 | x = self.norm_layers_0[i](x + y) 151 | 152 | y = self.encdec_attn_layers[i](x, h, encdec_attn_mask) 153 | y = self.drop(y) 154 | x = self.norm_layers_1[i](x + y) 155 | 156 | y = self.ffn_layers[i](x, x_mask) 157 | y = self.drop(y) 158 | x = self.norm_layers_2[i](x + y) 159 | x = x * x_mask 160 | return x 161 | 162 | 163 | class MultiHeadAttention(nn.Module): 164 | def __init__( 165 | self, 166 | channels, 167 | out_channels, 168 | n_heads, 169 | p_dropout=0., 170 | window_size=None, 171 | heads_share=True, 172 | block_length=None, 173 | proximal_bias=False, 174 | proximal_init=False 175 | ): 176 | super().__init__() 177 | assert channels % n_heads == 0 178 | 179 | self.channels = channels 180 | self.out_channels = out_channels 181 | self.n_heads = n_heads 182 | self.p_dropout = p_dropout 183 | self.window_size = window_size 184 | self.heads_share = heads_share 185 | self.block_length = block_length 186 | self.proximal_bias = proximal_bias 187 | self.proximal_init = proximal_init 188 | self.attn = None 189 | 190 | self.k_channels = channels // n_heads 191 | self.conv_q = nn.Conv1d(channels, channels, 1) 192 | self.conv_k = nn.Conv1d(channels, channels, 1) 193 | self.conv_v = nn.Conv1d(channels, channels, 1) 194 | self.conv_o = nn.Conv1d(channels, out_channels, 1) 195 | self.drop = nn.Dropout(p_dropout) 196 | 197 | if window_size is not None: 198 | n_heads_rel = 1 if heads_share else n_heads 199 | rel_stddev = self.k_channels**-0.5 200 | self.emb_rel_k = nn.Parameter(torch.randn( 201 | n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev) 202 | self.emb_rel_v = nn.Parameter(torch.randn( 203 | n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev) 204 | 205 | nn.init.xavier_uniform_(self.conv_q.weight) 206 | nn.init.xavier_uniform_(self.conv_k.weight) 207 | nn.init.xavier_uniform_(self.conv_v.weight) 208 | if proximal_init: 209 | with torch.no_grad(): 210 | self.conv_k.weight.copy_(self.conv_q.weight) 211 | self.conv_k.bias.copy_(self.conv_q.bias) 212 | 213 | def forward(self, x, c, attn_mask=None): 214 | q = self.conv_q(x) 215 | k = self.conv_k(c) 216 | v = self.conv_v(c) 217 | 218 | x, self.attn = self.attention(q, k, v, mask=attn_mask) 219 | 220 | x = self.conv_o(x) 221 | return x 222 | 223 | def attention(self, query, key, value, mask=None): 224 | # reshape [b, d, t] -> [b, n_h, t, d_k] 225 | b, d, t_s, t_t = (*key.size(), query.size(2)) 226 | #query = query.view( 227 | # b, 228 | # self.n_heads, 229 | # self.k_channels, 230 | # t_t 231 | #).transpose(2, 3) #[b,h,t_t,c], d=h*c 232 | #key = key.view( 233 | # b, 234 | # self.n_heads, 235 | # self.k_channels, 236 | # t_s 237 | #).transpose(2, 3) #[b,h,t_s,c] 238 | #value = value.view( 239 | # b, 240 | # self.n_heads, 241 | # self.k_channels, 242 | # t_s 243 | #).transpose(2, 3) #[b,h,t_s,c] 244 | #scores = torch.matmul( 245 | # query / math.sqrt(self.k_channels), key.transpose(-2, -1) 246 | #) #[b,h,t_t,t_s] 247 | query = query.view( 248 | b, 249 | self.n_heads, 250 | self.k_channels, 251 | t_t 252 | ) #[b,h,c,t_t] 253 | key = key.view( 254 | b, 255 | self.n_heads, 256 | self.k_channels, 257 | t_s 258 | ) #[b,h,c,t_s] 259 | value = value.view( 260 | b, 261 | self.n_heads, 262 | self.k_channels, 263 | t_s 264 | ) #[b,h,c,t_s] 265 | scores = torch.einsum('bhdt,bhds -> bhts', query / math.sqrt(self.k_channels), key) #[b,h,t_t,t_s] 266 | #if self.window_size is not None: 267 | # assert t_s == t_t, "Relative attention is only available for self-attention." 268 | # key_relative_embeddings = self._get_relative_embeddings( 269 | # self.emb_rel_k, t_s 270 | # ) 271 | # rel_logits = self._matmul_with_relative_keys( 272 | # query / math.sqrt(self.k_channels), key_relative_embeddings 273 | # ) #[b,h,t_t,d],[h or 1,e,d] ->[b,h,t_t,e] 274 | # scores_local = self._relative_position_to_absolute_position(rel_logits) 275 | # scores = scores + scores_local 276 | #if self.proximal_bias: 277 | # assert t_s == t_t, "Proximal bias is only available for self-attention." 278 | # scores = scores + \ 279 | # self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype) 280 | #if mask is not None: 281 | # scores = scores.masked_fill(mask == 0, -1e4) 282 | # if self.block_length is not None: 283 | # assert t_s == t_t, "Local attention is only available for self-attention." 284 | # block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length) 285 | # scores = scores.masked_fill(block_mask == 0, -1e4) 286 | #p_attn = F.softmax(scores, dim=-1) # [b, h, t_t, t_s] 287 | #p_attn = self.drop(p_attn) 288 | #output = torch.matmul(p_attn, value) # [b,h,t_t,t_s],[b,h,t_s,c] -> [b,h,t_t,c] 289 | #if self.window_size is not None: 290 | # relative_weights = self._absolute_position_to_relative_position(p_attn) #[b, h, t_t, 2*t_t-1] 291 | # value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s) #[h or 1, 2*t_t-1, c] 292 | # output = output + \ 293 | # self._matmul_with_relative_values( 294 | # relative_weights, value_relative_embeddings) # [b, h, t_t, 2*t_t-1],[h or 1, 2*t_t-1, c] -> [b, h, t_t, c] 295 | #output = output.transpose(2, 3).contiguous().view(b, d, t_t) # [b, n_h, t_t, c] -> [b,h,c,t_t] -> [b, d, t_t] 296 | if self.window_size is not None: 297 | assert t_s == t_t, "Relative attention is only available for self-attention." 298 | key_relative_embeddings = self._get_relative_embeddings( 299 | self.emb_rel_k, t_s 300 | ) 301 | rel_logits = torch.einsum('bhdt,hed->bhte', 302 | query / math.sqrt(self.k_channels), key_relative_embeddings 303 | ) #[b,h,c,t_t],[h or 1,e,c] ->[b,h,t_t,e] 304 | scores_local = self._relative_position_to_absolute_position(rel_logits) 305 | scores = scores + scores_local 306 | if self.proximal_bias: 307 | assert t_s == t_t, "Proximal bias is only available for self-attention." 308 | scores = scores + \ 309 | self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype) 310 | if mask is not None: 311 | scores = scores.masked_fill(mask == 0, -1e4) 312 | if self.block_length is not None: 313 | assert t_s == t_t, "Local attention is only available for self-attention." 314 | block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length) 315 | scores = scores.masked_fill(block_mask == 0, -1e4) 316 | p_attn = F.softmax(scores, dim=-1) # [b, h, t_t, t_s] 317 | p_attn = self.drop(p_attn) 318 | output = torch.einsum('bhcs,bhts->bhct', value , p_attn) # [b,h,c,t_s],[b,h,t_t,t_s] -> [b,h,c,t_t] 319 | if self.window_size is not None: 320 | relative_weights = self._absolute_position_to_relative_position(p_attn) #[b, h, t_t, 2*t_t-1] 321 | value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s) #[h or 1, 2*t_t-1, c] 322 | output = output + \ 323 | torch.einsum('bhte,hec->bhct', 324 | relative_weights, value_relative_embeddings) # [b, h, t_t, 2*t_t-1],[h or 1, 2*t_t-1, c] -> [b, h, c, t_t] 325 | output = output.view(b, d, t_t) # [b, h, c, t_t] -> [b, d, t_t] 326 | return output, p_attn 327 | 328 | def _matmul_with_relative_values(self, x, y): 329 | """ 330 | x: [b, h, l, m] 331 | y: [h or 1, m, d] 332 | ret: [b, h, l, d] 333 | """ 334 | ret = torch.matmul(x, y.unsqueeze(0)) 335 | return ret 336 | 337 | def _matmul_with_relative_keys(self, x, y): 338 | """ 339 | x: [b, h, l, d] 340 | y: [h or 1, m, d] 341 | ret: [b, h, l, m] 342 | """ 343 | #ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1)) 344 | ret = torch.einsum('bhld,hmd -> bhlm', x, y) 345 | return ret 346 | 347 | def _get_relative_embeddings(self, relative_embeddings, length): 348 | max_relative_position = 2 * self.window_size + 1 349 | # Pad first before slice to avoid using cond ops. 350 | pad_length = max(length - (self.window_size + 1), 0) 351 | slice_start_position = max((self.window_size + 1) - length, 0) 352 | slice_end_position = slice_start_position + 2 * length - 1 353 | if pad_length > 0: 354 | padded_relative_embeddings = F.pad( 355 | relative_embeddings, 356 | commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]])) 357 | else: 358 | padded_relative_embeddings = relative_embeddings 359 | used_relative_embeddings = padded_relative_embeddings[ 360 | :, slice_start_position:slice_end_position 361 | ] 362 | return used_relative_embeddings 363 | 364 | def _relative_position_to_absolute_position(self, x): 365 | """ 366 | x: [b, h, l, 2*l-1] 367 | ret: [b, h, l, l] 368 | """ 369 | batch, heads, length, _ = x.size() 370 | # Concat columns of pad to shift from relative to absolute indexing. 371 | x = F.pad(x, commons.convert_pad_shape( 372 | [[0, 0], [0, 0], [0, 0], [0, 1]] 373 | )) 374 | 375 | # Concat extra elements so to add up to shape (len+1, 2*len-1). 376 | x_flat = x.view([batch, heads, length * 2 * length]) 377 | x_flat = F.pad(x_flat, commons.convert_pad_shape( 378 | [[0, 0], [0, 0], [0, length-1]] 379 | )) 380 | 381 | # Reshape and slice out the padded elements. 382 | x_final = x_flat.view([batch, heads, length+1, 2*length-1])[ 383 | :, :, :length, length-1: 384 | ] 385 | return x_final 386 | 387 | def _absolute_position_to_relative_position(self, x): 388 | """ 389 | x: [b, h, l, l] 390 | ret: [b, h, l, 2*l-1] 391 | """ 392 | batch, heads, length, _ = x.size() 393 | # padd along column 394 | x = F.pad(x, commons.convert_pad_shape( 395 | [[0, 0], [0, 0], [0, 0], [0, length-1]] 396 | )) 397 | x_flat = x.view([batch, heads, length**2 + length*(length - 1)]) 398 | # add 0's in the beginning that will skew the elements after reshape 399 | x_flat = F.pad(x_flat, commons.convert_pad_shape( 400 | [[0, 0], [0, 0], [length, 0]] 401 | )) 402 | x_final = x_flat.view([batch, heads, length, 2*length])[:, :, :, 1:] 403 | return x_final 404 | 405 | def _attention_bias_proximal(self, length): 406 | """Bias for self-attention to encourage attention to close positions. 407 | Args: 408 | length: an integer scalar. 409 | Returns: 410 | a Tensor with shape [1, 1, length, length] 411 | """ 412 | r = torch.arange(length, dtype=torch.float32) 413 | diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1) 414 | return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0) 415 | 416 | 417 | class FFN(nn.Module): 418 | def __init__( 419 | self, 420 | in_channels, 421 | out_channels, 422 | filter_channels, 423 | kernel_size, 424 | p_dropout=0., 425 | activation=None, 426 | causal=False 427 | ): 428 | super().__init__() 429 | self.in_channels = in_channels 430 | self.out_channels = out_channels 431 | self.filter_channels = filter_channels 432 | self.kernel_size = kernel_size 433 | self.p_dropout = p_dropout 434 | self.activation = activation 435 | self.causal = causal 436 | 437 | if causal: 438 | self.padding = self._causal_padding 439 | else: 440 | self.padding = self._same_padding 441 | 442 | self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size) 443 | self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size) 444 | self.drop = nn.Dropout(p_dropout) 445 | 446 | def forward(self, x, x_mask): 447 | x = self.conv_1(self.padding(x * x_mask)) 448 | if self.activation == "gelu": 449 | x = x * torch.sigmoid(1.702 * x) 450 | else: 451 | x = torch.relu(x) 452 | x = self.drop(x) 453 | x = self.conv_2(self.padding(x * x_mask)) 454 | return x * x_mask 455 | 456 | def _causal_padding(self, x): 457 | if self.kernel_size == 1: 458 | return x 459 | pad_l = self.kernel_size - 1 460 | pad_r = 0 461 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 462 | x = F.pad(x, commons.convert_pad_shape(padding)) 463 | return x 464 | 465 | def _same_padding(self, x): 466 | if self.kernel_size == 1: 467 | return x 468 | pad_l = (self.kernel_size - 1) // 2 469 | pad_r = self.kernel_size // 2 470 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 471 | x = F.pad(x, commons.convert_pad_shape(padding)) 472 | return x 473 | -------------------------------------------------------------------------------- /commons.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import math 3 | import torch 4 | from torch.nn import functional as F 5 | 6 | 7 | def init_weights(m, mean=0.0, std=0.01): 8 | classname = m.__class__.__name__ 9 | if classname.find("Conv") != -1: 10 | m.weight.data.normal_(mean, std) 11 | 12 | 13 | def get_padding(kernel_size, dilation=1): 14 | return int((kernel_size * dilation - dilation) / 2) 15 | 16 | 17 | def convert_pad_shape(pad_shape): 18 | l = pad_shape[::-1] 19 | pad_shape = [item for sublist in l for item in sublist] 20 | return pad_shape 21 | 22 | 23 | def intersperse(lst, item): 24 | result = [item] * (len(lst) * 2 + 1) 25 | result[1::2] = lst 26 | return result 27 | 28 | 29 | def kl_divergence(m_p, logs_p, m_q, logs_q): 30 | """KL(P||Q)""" 31 | kl = (logs_q - logs_p) - 0.5 32 | kl += 0.5 * (torch.exp(2. * logs_p) + ((m_p - m_q)**2)) * torch.exp(-2. * logs_q) 33 | return kl 34 | 35 | 36 | def rand_gumbel(shape): 37 | """Sample from the Gumbel distribution, protect from overflows.""" 38 | uniform_samples = torch.rand(shape) * 0.99998 + 0.00001 39 | return -torch.log(-torch.log(uniform_samples)) 40 | 41 | 42 | def rand_gumbel_like(x): 43 | g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device) 44 | return g 45 | 46 | 47 | def slice_segments(x, ids_str, segment_size=4): 48 | ret = torch.zeros_like(x[:, :, :segment_size]) 49 | for i in range(x.size(0)): 50 | idx_str = ids_str[i] 51 | idx_end = idx_str + segment_size 52 | ret[i] = x[i, :, idx_str:idx_end] 53 | return ret 54 | 55 | 56 | def rand_slice_segments(x, x_lengths=None, segment_size=4): 57 | b, d, t = x.size() 58 | if x_lengths is None: 59 | x_lengths = t 60 | ids_str_max = x_lengths - segment_size + 1 61 | ids_str = (torch.rand([b]).to(device=x.device) 62 | * ids_str_max).to(dtype=torch.long) 63 | ids_str = torch.max(torch.zeros(ids_str.size()).to(ids_str.device), ids_str).to(dtype=torch.long) 64 | ret = slice_segments(x, ids_str, segment_size) 65 | return ret, ids_str 66 | 67 | def rand_slice_segments_for_cat(x, x_lengths=None, segment_size=4): 68 | b, d, t = x.size() 69 | if x_lengths is None: 70 | x_lengths = t 71 | ids_str_max = x_lengths - segment_size + 1 72 | ids_str = torch.rand([b//2]).to(device=x.device) 73 | ids_str = (torch.cat([ids_str,ids_str], dim=0) 74 | * ids_str_max).to(dtype=torch.long) 75 | ids_str = torch.max(torch.zeros(ids_str.size()).to(ids_str.device), ids_str).to(dtype=torch.long) 76 | ret = slice_segments(x, ids_str, segment_size) 77 | return ret, ids_str 78 | 79 | 80 | 81 | 82 | def get_timing_signal_1d( 83 | length, channels, min_timescale=1.0, max_timescale=1.0e4): 84 | position = torch.arange(length, dtype=torch.float) 85 | num_timescales = channels // 2 86 | log_timescale_increment = ( 87 | math.log(float(max_timescale) / float(min_timescale)) / (num_timescales - 1) 88 | ) 89 | inv_timescales = min_timescale * torch.exp( 90 | torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment 91 | ) 92 | scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1) 93 | signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0) 94 | signal = F.pad(signal, [0, 0, 0, channels % 2]) 95 | signal = signal.view(1, channels, length) 96 | return signal 97 | 98 | 99 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4): 100 | b, channels, length = x.size() 101 | signal = get_timing_signal_1d( 102 | length, channels, min_timescale, max_timescale 103 | ) 104 | return x + signal.to(dtype=x.dtype, device=x.device) 105 | 106 | 107 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1): 108 | b, channels, length = x.size() 109 | signal = get_timing_signal_1d( 110 | length, channels, min_timescale, max_timescale 111 | ) 112 | return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis) 113 | 114 | 115 | def subsequent_mask(length): 116 | mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0) 117 | return mask 118 | 119 | 120 | @torch.jit.script 121 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels): 122 | n_channels_int = n_channels[0] 123 | in_act = input_a + input_b 124 | t_act = torch.tanh(in_act[:, :n_channels_int, :]) 125 | s_act = torch.sigmoid(in_act[:, n_channels_int:, :]) 126 | acts = t_act * s_act 127 | return acts 128 | 129 | 130 | def convert_pad_shape(pad_shape): 131 | l = pad_shape[::-1] 132 | pad_shape = [item for sublist in l for item in sublist] 133 | return pad_shape 134 | 135 | 136 | def shift_1d(x): 137 | x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1] 138 | return x 139 | 140 | 141 | def sequence_mask(length, max_length=None): 142 | if max_length is None: 143 | max_length = length.max() 144 | x = torch.arange(max_length, dtype=length.dtype, device=length.device) 145 | return x.unsqueeze(0) < length.unsqueeze(1) 146 | 147 | 148 | def generate_path(duration, mask): 149 | """ 150 | duration: [b, 1, t_x] 151 | mask: [b, 1, t_y, t_x] 152 | """ 153 | device = duration.device 154 | 155 | b, _, t_y, t_x = mask.shape 156 | cum_duration = torch.cumsum(duration, -1) 157 | 158 | cum_duration_flat = cum_duration.view(b * t_x) 159 | path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype) 160 | path = path.view(b, t_x, t_y) 161 | path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1] 162 | path = path.unsqueeze(1).transpose(2, 3) * mask 163 | return path 164 | 165 | 166 | def clip_grad_value_(parameters, clip_value, norm_type=2): 167 | if isinstance(parameters, torch.Tensor): 168 | parameters = [parameters] 169 | parameters = list(filter(lambda p: p.grad is not None, parameters)) 170 | norm_type = float(norm_type) 171 | if clip_value is not None: 172 | clip_value = float(clip_value) 173 | 174 | total_norm = 0 175 | for p in parameters: 176 | param_norm = p.grad.data.norm(norm_type) 177 | total_norm += param_norm.item() ** norm_type 178 | if clip_value is not None: 179 | p.grad.data.clamp_(min=-clip_value, max=clip_value) 180 | total_norm = total_norm ** (1. / norm_type) 181 | return total_norm 182 | -------------------------------------------------------------------------------- /configs/config_en.yaml: -------------------------------------------------------------------------------- 1 | train: 2 | log_interval: 200 # step unit 3 | eval_interval: 400 # step unit 4 | save_interval: 50 # epoch unit: 50 for baseline / 500 for fine-tuning 5 | seed: 1234 6 | epochs: 7000 7 | learning_rate: 2e-4 8 | betas: [0.8, 0.99] 9 | eps: 1e-9 10 | batch_size: 48 11 | fp16_run: True #False 12 | lr_decay: 0.999875 13 | segment_size: 8192 14 | c_mel: 45 15 | c_kl: 1.0 16 | c_vq: 1. 17 | c_commit: 0.2 18 | c_yin: 45. 19 | log_path: "/pits/logs" 20 | n_sample: 3 21 | alpha: 200 22 | 23 | data: 24 | data_path: "/DATA/audio/VCTK-0.92" 25 | training_files: "filelists/vctk_train_g2p.txt" 26 | validation_files: "filelists/vctk_val_g2p.txt" 27 | languages: "en_US" 28 | text_cleaners: ["english_cleaners"] 29 | sampling_rate: 22050 30 | filter_length: 1024 31 | hop_length: 256 32 | win_length: 1024 33 | n_mel_channels: 80 34 | mel_fmin: 0.0 35 | mel_fmax: null 36 | add_blank: True 37 | speakers: ["p225", "p226", "p227", "p228", "p229", "p230", "p231", "p232", "p233", "p234", "p236", "p237", "p238", "p239", "p240", "p241", "p243", "p244", "p245", "p246", "p247", "p248", "p249", "p250", "p251", "p252", "p253", "p254", "p255", "p256", "p257", "p258", "p259", "p260", "p261", "p262", "p263", "p264", "p265", "p266", "p267", "p268", "p269", "p270", "p271", "p272", "p273", "p274", "p275", "p276", "p277", "p278", "p279", "p281", "p282", "p283", "p284", "p285", "p286", "p287", "p288", "p292", "p293", "p294", "p295", "p297", "p298", "p299", "p300", "p301", "p302", "p303", "p304", "p305", "p306", "p307", "p308", "p310", "p311", "p312", "p313", "p314", "p316", "p317", "p318", "p323", "p326", "p329", "p330", "p333", "p334", "p335", "p336", "p339", "p340", "p341", "p343", "p345", "p347", "p351", "p360", "p361", "p362", "p363", "p364", "p374", "p376", "s5"] 38 | persistent_workers: True 39 | midi_start: -5 40 | midi_end: 75 41 | midis: 80 42 | ying_window: 2048 43 | ying_hop: 256 44 | tau_max: 2048 45 | octave_range: 24 46 | 47 | model: 48 | inter_channels: 192 49 | hidden_channels: 192 50 | filter_channels: 768 51 | n_heads: 2 52 | n_layers: 6 53 | kernel_size: 3 54 | p_dropout: 0.1 55 | resblock: "1" 56 | resblock_kernel_sizes: [3,7,11] 57 | resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] 58 | upsample_rates: [8,8,2,2] 59 | upsample_initial_channel: 512 60 | upsample_kernel_sizes: [16,16,4,4] 61 | n_layers_q: 3 62 | use_spectral_norm: False 63 | gin_channels: 256 64 | codebook_size: 320 65 | yin_channels: 80 66 | yin_start: 15 # scope start bin in nansy = 1.5/8 67 | yin_scope: 50 # scope ratio in nansy = 5/8 68 | yin_shift_range: 15 # same as default start index of yingram 69 | -------------------------------------------------------------------------------- /configs/config_ja_22050.yaml: -------------------------------------------------------------------------------- 1 | train: 2 | log_interval: 200 # step unit 3 | eval_interval: 400 # step unit 4 | save_interval: 50 # epoch unit: 50 for baseline / 500 for fine-tuning 5 | seed: 1234 6 | epochs: 7000 7 | learning_rate: 2e-4 8 | betas: [0.8, 0.99] 9 | eps: 1e-9 10 | batch_size: 48 11 | fp16_run: True #False 12 | lr_decay: 0.999875 13 | segment_size: 8192 14 | c_mel: 45 15 | c_kl: 1.0 16 | c_vq: 1. 17 | c_commit: 0.2 18 | c_yin: 45. 19 | log_path: "/logs/" 20 | n_sample: 3 21 | alpha: 200 22 | 23 | data: 24 | data_path: "./dataset/jvs_ver1/" 25 | training_files: "./filelists/jvs_train_22050.txt" 26 | validation_files: "./filelists/jvs_val_22050.txt" 27 | languages: "pyopenjtalk_prosody" 28 | text_cleaners: [] 29 | sampling_rate: 22050 30 | filter_length: 1024 31 | hop_length: 256 32 | win_length: 1024 33 | n_mel_channels: 80 34 | mel_fmin: 0.0 35 | mel_fmax: null 36 | add_blank: True 37 | speakers: ['jvs001', 'jvs002', 'jvs003', 'jvs004', 'jvs005', 'jvs006', 'jvs007', 'jvs008', 'jvs009', 'jvs010', 'jvs011', 'jvs012', 'jvs013', 'jvs014', 'jvs015', 'jvs016', 'jvs017', 'jvs018', 'jvs019', 'jvs020', 'jvs021', 'jvs022', 'jvs023', 'jvs024', 'jvs025', 'jvs026', 'jvs027', 'jvs028', 'jvs029', 'jvs030', 'jvs031', 'jvs032', 'jvs033', 'jvs034', 'jvs035', 'jvs036', 'jvs037', 'jvs038', 'jvs039', 'jvs040', 'jvs041', 'jvs042', 'jvs043', 'jvs044', 'jvs045', 'jvs046', 'jvs047', 'jvs048', 'jvs049', 'jvs050', 'jvs051', 'jvs052', 'jvs053', 'jvs054', 'jvs055', 'jvs056', 'jvs057', 'jvs058', 'jvs059', 'jvs060', 'jvs061', 'jvs062', 'jvs063', 'jvs064', 'jvs065', 'jvs066', 'jvs067', 'jvs068', 'jvs069', 'jvs070', 'jvs071', 'jvs072', 'jvs073', 'jvs074', 'jvs075', 'jvs076', 'jvs077', 'jvs078', 'jvs079', 'jvs080', 'jvs081', 'jvs082', 'jvs083', 'jvs084', 'jvs085', 'jvs086', 'jvs087', 'jvs088', 'jvs089', 'jvs090', 'jvs091', 'jvs092', 'jvs093', 'jvs094', 'jvs095', 'jvs096', 'jvs097', 'jvs098', 'jvs099', 'jvs100'] 38 | persistent_workers: True 39 | midi_start: -5 40 | midi_end: 75 41 | midis: 80 42 | ying_window: 2048 43 | ying_hop: 256 44 | tau_max: 2048 45 | octave_range: 24 46 | 47 | model: 48 | inter_channels: 192 49 | hidden_channels: 192 50 | filter_channels: 768 51 | n_heads: 2 52 | n_layers: 6 53 | kernel_size: 3 54 | p_dropout: 0.1 55 | resblock: "1" 56 | resblock_kernel_sizes: [3,7,11] 57 | resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] 58 | upsample_rates: [8,8,2,2] 59 | upsample_initial_channel: 512 60 | upsample_kernel_sizes: [16,16,4,4] 61 | n_layers_q: 3 62 | use_spectral_norm: False 63 | gin_channels: 256 64 | codebook_size: 320 65 | yin_channels: 80 66 | yin_start: 15 # scope start bin in nansy = 1.5/8 67 | yin_scope: 50 # scope ratio in nansy = 5/8 68 | yin_shift_range: 15 # same as default start index of yingram 69 | -------------------------------------------------------------------------------- /configs/config_ja_44100.yaml: -------------------------------------------------------------------------------- 1 | train: 2 | log_interval: 200 # step unit 3 | eval_interval: 400 # step unit 4 | save_interval: 10 # epoch unit: 50 for baseline / 500 for fine-tuning 5 | seed: 1234 6 | epochs: 7000 7 | learning_rate: 2e-4 8 | betas: [0.8, 0.99] 9 | eps: 1e-9 10 | batch_size: 32 11 | fp16_run: True #False 12 | lr_decay: 0.999875 13 | segment_size: 16384 #8192 14 | c_mel: 45 15 | c_kl: 1.0 16 | c_vq: 1. 17 | c_commit: 0.2 18 | c_yin: 45. 19 | log_path: "/logs/" 20 | n_sample: 3 21 | alpha: 200 22 | 23 | data: 24 | data_path: "./dataset/jvs_ver1/" 25 | training_files: "./filelists/jvs_train_44100.txt" 26 | validation_files: "./filelists/jvs_val_44100.txt" 27 | languages: "pyopenjtalk_prosody" 28 | text_cleaners: [] 29 | sampling_rate: 44100 #22050 30 | filter_length: 2048 #1024 31 | hop_length: 512 #256 32 | win_length: 2048 #1024 33 | n_mel_channels: 80 34 | mel_fmin: 0.0 35 | mel_fmax: null 36 | add_blank: True 37 | speakers: ['jvs001', 'jvs002', 'jvs003', 'jvs004', 'jvs005', 'jvs006', 'jvs007', 'jvs008', 'jvs009', 'jvs010', 'jvs011', 'jvs012', 'jvs013', 'jvs014', 'jvs015', 'jvs016', 'jvs017', 'jvs018', 'jvs019', 'jvs020', 'jvs021', 'jvs022', 'jvs023', 'jvs024', 'jvs025', 'jvs026', 'jvs027', 'jvs028', 'jvs029', 'jvs030', 'jvs031', 'jvs032', 'jvs033', 'jvs034', 'jvs035', 'jvs036', 'jvs037', 'jvs038', 'jvs039', 'jvs040', 'jvs041', 'jvs042', 'jvs043', 'jvs044', 'jvs045', 'jvs046', 'jvs047', 'jvs048', 'jvs049', 'jvs050', 'jvs051', 'jvs052', 'jvs053', 'jvs054', 'jvs055', 'jvs056', 'jvs057', 'jvs058', 'jvs059', 'jvs060', 'jvs061', 'jvs062', 'jvs063', 'jvs064', 'jvs065', 'jvs066', 'jvs067', 'jvs068', 'jvs069', 'jvs070', 'jvs071', 'jvs072', 'jvs073', 'jvs074', 'jvs075', 'jvs076', 'jvs077', 'jvs078', 'jvs079', 'jvs080', 'jvs081', 'jvs082', 'jvs083', 'jvs084', 'jvs085', 'jvs086', 'jvs087', 'jvs088', 'jvs089', 'jvs090', 'jvs091', 'jvs092', 'jvs093', 'jvs094', 'jvs095', 'jvs096', 'jvs097', 'jvs098', 'jvs099', 'jvs100'] 38 | persistent_workers: True 39 | midi_start: -5 40 | midi_end: 75 41 | midis: 80 42 | ying_window: 4096 #2048 =tau_max 43 | ying_hop: 512 #256 44 | tau_max: 4096 #2048 diff maximum period 45 | octave_range: 24 46 | 47 | model: 48 | inter_channels: 192 49 | hidden_channels: 192 50 | filter_channels: 768 51 | n_heads: 2 52 | n_layers: 6 53 | kernel_size: 3 54 | p_dropout: 0.1 55 | resblock: "1" 56 | resblock_kernel_sizes: [3,7,11] 57 | resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] 58 | upsample_rates: [8,8,2,2,2] #[8,8,2,2] 59 | upsample_initial_channel: 512 60 | upsample_kernel_sizes: [16,16,4,4,4] #[16,16,4,4] 61 | n_layers_q: 3 62 | use_spectral_norm: False 63 | gin_channels: 256 64 | codebook_size: 320 65 | yin_channels: 80 66 | yin_start: 15 # scope start bin in nansy = 1.5/8 67 | yin_scope: 50 # scope ratio in nansy = 5/8 68 | yin_shift_range: 15 # same as default start index of yingram 69 | -------------------------------------------------------------------------------- /data_utils.py: -------------------------------------------------------------------------------- 1 | # modified from https://github.com/jaywalnut310/vits 2 | import os 3 | import random 4 | import torch 5 | import torch.utils.data 6 | 7 | import commons 8 | from mel_processing import spectrogram_torch 9 | from utils import load_wav_to_torch, load_filepaths_and_text 10 | from text import text_to_sequence 11 | from analysis import Pitch 12 | """ Modified from Multi speaker version of VITS""" 13 | 14 | 15 | class TextAudioSpeakerLoader(torch.utils.data.Dataset): 16 | """ 17 | 1) loads audio, speaker_id, text pairs 18 | 2) normalizes text and converts them to sequences of integers 19 | 3) computes spectrograms from audio files. 20 | """ 21 | 22 | def __init__(self, audiopaths_sid_text, hparams, pt_run=False): 23 | self.audiopaths_sid_text = load_filepaths_and_text(audiopaths_sid_text) 24 | self.text_cleaners = hparams.text_cleaners 25 | self.sampling_rate = hparams.sampling_rate 26 | self.filter_length = hparams.filter_length 27 | self.hop_length = hparams.hop_length 28 | self.win_length = hparams.win_length 29 | 30 | self.lang = hparams.languages 31 | 32 | self.add_blank = hparams.add_blank 33 | self.min_text_len = getattr(hparams, "min_text_len", 1) 34 | self.max_text_len = getattr(hparams, "max_text_len", 190) 35 | 36 | self.speaker_dict = { 37 | speaker: idx 38 | for idx, speaker in enumerate(hparams.speakers) 39 | } 40 | self.data_path = hparams.data_path 41 | 42 | self.pitch = Pitch(sr=hparams.sampling_rate, 43 | ### modify ############## 44 | #W=hparams.tau_max, 45 | W=hparams.ying_window, 46 | w_step=hparams.ying_hop, 47 | ######################### 48 | tau_max=hparams.tau_max, 49 | midi_start=hparams.midi_start, 50 | midi_end=hparams.midi_end, 51 | octave_range=hparams.octave_range) 52 | 53 | random.seed(1234) 54 | random.shuffle(self.audiopaths_sid_text) 55 | self._filter() 56 | if pt_run: 57 | for _audiopaths_sid_text in self.audiopaths_sid_text: 58 | _ = self.get_audio_text_speaker_pair(_audiopaths_sid_text, 59 | True) 60 | 61 | def _filter(self): 62 | """ 63 | Filter text & store spec lengths 64 | """ 65 | # Store spectrogram lengths for Bucketing 66 | # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2) 67 | # spec_length = wav_length // hop_length 68 | 69 | audiopaths_sid_text_new = [] 70 | lengths = [] 71 | for audiopath, text, spk in self.audiopaths_sid_text: 72 | audiopath = audiopath.replace("\\", "/") 73 | if self.min_text_len <= len(text) and len( 74 | text) <= self.max_text_len: 75 | audiopath = os.path.join(self.data_path, audiopath) 76 | audiopaths_sid_text_new.append([audiopath, text, spk]) 77 | lengths.append( 78 | os.path.getsize(audiopath) // (2 * self.hop_length)) 79 | self.audiopaths_sid_text = audiopaths_sid_text_new 80 | self.lengths = lengths 81 | 82 | def get_audio_text_speaker_pair(self, audiopath_sid_text, pt_run=False): 83 | # separate filename, speaker_id and text 84 | audiopath, text, spk = audiopath_sid_text[0], audiopath_sid_text[ 85 | 1], audiopath_sid_text[2] 86 | text, tone = self.get_text(text) 87 | spec, ying, wav = self.get_audio(audiopath, pt_run) 88 | sid = self.get_sid(self.speaker_dict[spk]) 89 | return (text, spec, ying, wav, sid, tone) 90 | 91 | def get_audio(self, filename, pt_run=False): 92 | audio, sampling_rate = load_wav_to_torch(filename) 93 | if sampling_rate != self.sampling_rate: 94 | raise ValueError("{} {} SR doesn't match target {} SR".format( 95 | sampling_rate, self.sampling_rate)) 96 | audio_norm = audio.unsqueeze(0) 97 | spec_filename = filename.replace(".wav", ".spec.pt") 98 | ying_filename = filename.replace(".wav", ".ying.pt") 99 | if os.path.exists(spec_filename) and not pt_run: 100 | spec = torch.load(spec_filename, map_location='cpu') 101 | else: 102 | spec = spectrogram_torch(audio_norm, 103 | self.filter_length, 104 | self.sampling_rate, 105 | self.hop_length, 106 | self.win_length, 107 | center=False) 108 | spec = torch.squeeze(spec, 0) 109 | torch.save(spec, spec_filename) 110 | if os.path.exists(ying_filename) and not pt_run: 111 | ying = torch.load(ying_filename, map_location='cpu') 112 | else: 113 | wav = torch.nn.functional.pad( 114 | audio_norm.unsqueeze(0), 115 | (self.filter_length - self.hop_length, 116 | self.filter_length - self.hop_length + 117 | (-audio_norm.shape[1]) % self.hop_length + self.hop_length * (audio_norm.shape[1] % self.hop_length == 0)), 118 | mode='constant').squeeze(0) 119 | ying = self.pitch.yingram(wav)[0] 120 | torch.save(ying, ying_filename) 121 | return spec, ying, audio_norm 122 | 123 | def get_text(self, text): 124 | text_norm, tone = text_to_sequence(text, self.lang) 125 | if self.add_blank: 126 | text_norm = commons.intersperse(text_norm, 0) 127 | tone = commons.intersperse(tone, 0) 128 | text_norm = torch.LongTensor(text_norm) 129 | tone = torch.LongTensor(tone) 130 | return text_norm, tone 131 | 132 | def get_sid(self, sid): 133 | sid = torch.LongTensor([int(sid)]) 134 | return sid 135 | 136 | def __getitem__(self, index): 137 | return self.get_audio_text_speaker_pair( 138 | self.audiopaths_sid_text[index]) 139 | 140 | def __len__(self): 141 | return len(self.audiopaths_sid_text) 142 | 143 | 144 | class TextAudioSpeakerCollate(): 145 | """ Zero-pads model inputs and targets""" 146 | 147 | def __init__(self, return_ids=False): 148 | self.return_ids = return_ids 149 | 150 | def __call__(self, batch): 151 | """Collate's training batch from normalized text, audio and speaker identities 152 | PARAMS 153 | ------ 154 | batch: [text_normalized, spec_normalized, wav_normalized, sid] 155 | """ 156 | # Right zero-pad all one-hot text sequences to max input length 157 | _, ids_sorted_decreasing = torch.sort(torch.LongTensor( 158 | [x[1].size(1) for x in batch]), 159 | dim=0, 160 | descending=True) 161 | 162 | max_text_len = max([len(x[0]) for x in batch]) 163 | max_spec_len = max([x[1].size(1) for x in batch]) 164 | max_ying_len = max([x[2].size(1) for x in batch]) 165 | max_wav_len = max([x[3].size(1) for x in batch]) 166 | 167 | text_lengths = torch.LongTensor(len(batch)) 168 | spec_lengths = torch.LongTensor(len(batch)) 169 | ying_lengths = torch.LongTensor(len(batch)) 170 | wav_lengths = torch.LongTensor(len(batch)) 171 | sid = torch.LongTensor(len(batch)) 172 | 173 | text_padded = torch.LongTensor(len(batch), max_text_len) 174 | tone_padded = torch.LongTensor(len(batch), max_text_len) 175 | spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0), 176 | max_spec_len) 177 | ying_padded = torch.FloatTensor(len(batch), batch[0][2].size(0), 178 | max_ying_len) 179 | wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len) 180 | text_padded.zero_() 181 | tone_padded.zero_() 182 | spec_padded.zero_() 183 | ying_padded.zero_() 184 | wav_padded.zero_() 185 | for i in range(len(ids_sorted_decreasing)): 186 | row = batch[ids_sorted_decreasing[i]] 187 | 188 | text = row[0] 189 | text_padded[i, :text.size(0)] = text 190 | text_lengths[i] = text.size(0) 191 | 192 | spec = row[1] 193 | spec_padded[i, :, :spec.size(1)] = spec 194 | spec_lengths[i] = spec.size(1) 195 | 196 | ying = row[2] 197 | ying_padded[i, :, :ying.size(1)] = ying 198 | ying_lengths[i] = ying.size(1) 199 | 200 | wav = row[3] 201 | wav_padded[i, :, :wav.size(1)] = wav 202 | wav_lengths[i] = wav.size(1) 203 | 204 | tone = row[5] 205 | tone_padded[i, :text.size(0)] = tone 206 | 207 | sid[i] = row[4] 208 | 209 | if self.return_ids: 210 | return text_padded, text_lengths, spec_padded, spec_lengths, wav_padded, wav_lengths, sid, ids_sorted_decreasing 211 | return text_padded, text_lengths, spec_padded, spec_lengths, ying_padded, ying_lengths, wav_padded, wav_lengths, sid, tone_padded 212 | 213 | 214 | class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler 215 | ): 216 | """ 217 | Maintain similar input lengths in a batch. 218 | Length groups are specified by boundaries. 219 | Ex) boundaries = [b1, b2, b3] -> any batch is included either {x | b1 < length(x) <=b2} or {x | b2 < length(x) <= b3}. 220 | 221 | It removes samples which are not included in the boundaries. 222 | Ex) boundaries = [b1, b2, b3] -> any x s.t. length(x) <= b1 or length(x) > b3 are discarded. 223 | """ 224 | 225 | def __init__(self, 226 | dataset, 227 | batch_size, 228 | boundaries, 229 | num_replicas=None, 230 | rank=None, 231 | shuffle=True): 232 | super().__init__(dataset, 233 | num_replicas=num_replicas, 234 | rank=rank, 235 | shuffle=shuffle) 236 | self.lengths = dataset.lengths 237 | self.batch_size = batch_size 238 | self.boundaries = boundaries 239 | 240 | self.buckets, self.num_samples_per_bucket = self._create_buckets() 241 | self.total_size = sum(self.num_samples_per_bucket) 242 | self.num_samples = self.total_size // self.num_replicas 243 | 244 | def _create_buckets(self): 245 | buckets = [[] for _ in range(len(self.boundaries) - 1)] 246 | for i in range(len(self.lengths)): 247 | length = self.lengths[i] 248 | idx_bucket = self._bisect(length) 249 | if idx_bucket != -1: 250 | buckets[idx_bucket].append(i) 251 | 252 | for i in range(len(buckets) - 1, -1, -1): 253 | if len(buckets[i]) == 0: 254 | buckets.pop(i) 255 | self.boundaries.pop(i + 1) 256 | 257 | num_samples_per_bucket = [] 258 | for i in range(len(buckets)): 259 | len_bucket = len(buckets[i]) 260 | total_batch_size = self.num_replicas * self.batch_size 261 | rem = (total_batch_size - 262 | (len_bucket % total_batch_size)) % total_batch_size 263 | num_samples_per_bucket.append(len_bucket + rem) 264 | return buckets, num_samples_per_bucket 265 | 266 | def __iter__(self): 267 | # deterministically shuffle based on epoch 268 | g = torch.Generator() 269 | g.manual_seed(self.epoch) 270 | 271 | indices = [] 272 | if self.shuffle: 273 | for bucket in self.buckets: 274 | indices.append( 275 | torch.randperm(len(bucket), generator=g).tolist()) 276 | else: 277 | for bucket in self.buckets: 278 | indices.append(list(range(len(bucket)))) 279 | 280 | batches = [] 281 | for i in range(len(self.buckets)): 282 | bucket = self.buckets[i] 283 | len_bucket = len(bucket) 284 | ids_bucket = indices[i] 285 | num_samples_bucket = self.num_samples_per_bucket[i] 286 | 287 | # add extra samples to make it evenly divisible 288 | rem = num_samples_bucket - len_bucket 289 | ids_bucket = ids_bucket + ids_bucket * \ 290 | (rem // len_bucket) + ids_bucket[:(rem % len_bucket)] 291 | 292 | # subsample 293 | ids_bucket = ids_bucket[self.rank::self.num_replicas] 294 | 295 | # batching 296 | for j in range(len(ids_bucket) // self.batch_size): 297 | batch = [ 298 | bucket[idx] 299 | for idx in ids_bucket[j * self.batch_size:(j + 1) * 300 | self.batch_size] 301 | ] 302 | batches.append(batch) 303 | 304 | if self.shuffle: 305 | batch_ids = torch.randperm(len(batches), generator=g).tolist() 306 | batches = [batches[i] for i in batch_ids] 307 | self.batches = batches 308 | 309 | assert len(self.batches) * self.batch_size == self.num_samples 310 | return iter(self.batches) 311 | 312 | def _bisect(self, x, lo=0, hi=None): 313 | if hi is None: 314 | hi = len(self.boundaries) - 1 315 | 316 | if hi > lo: 317 | mid = (hi + lo) // 2 318 | if self.boundaries[mid] < x and x <= self.boundaries[mid + 1]: 319 | return mid 320 | elif x <= self.boundaries[mid]: 321 | return self._bisect(x, lo, mid) 322 | else: 323 | return self._bisect(x, mid + 1, hi) 324 | else: 325 | return -1 326 | 327 | def __len__(self): 328 | return self.num_samples // self.batch_size 329 | 330 | ### add ### 331 | import copy 332 | from tqdm import tqdm 333 | ########### 334 | def create_spec(audiopaths_sid_text, hparams): 335 | audiopaths_sid_text_list = load_filepaths_and_text(audiopaths_sid_text) 336 | for audiopath, _, _ in tqdm(audiopaths_sid_text_list): ### add tqdm### 337 | 338 | ### modify ### 339 | audio_path = copy.deepcopy(audiopath) 340 | audiopath = os.path.join(hparams.data_path, audiopath) 341 | try: 342 | audio, sampling_rate = load_wav_to_torch(audiopath) 343 | except: 344 | with open(audiopaths_sid_text, mode="r", encoding="utf-8") as f: 345 | lines = f.readlines() 346 | for idx, line in enumerate(lines): 347 | check_path, sentence, spk_id = line.split("|") 348 | if check_path == audio_path: 349 | remove_path = lines.pop(idx) 350 | with open(audiopaths_sid_text, mode="w", encoding="utf-8") as f: 351 | f.writelines(lines) 352 | print(f"File is not found!!! path={remove_path} ==> Removed.") 353 | continue 354 | ############## 355 | 356 | if sampling_rate != hparams.sampling_rate: 357 | raise ValueError("{} {} SR doesn't match target {} SR".format( 358 | sampling_rate, hparams.sampling_rate)) 359 | audio_norm = audio.unsqueeze(0) 360 | specpath = audiopath.replace(".wav", ".spec.pt") 361 | 362 | if not os.path.exists(specpath): 363 | spec = spectrogram_torch(audio_norm, 364 | hparams.filter_length, 365 | hparams.sampling_rate, 366 | hparams.hop_length, 367 | hparams.win_length, 368 | center=False) 369 | spec = torch.squeeze(spec, 0) 370 | torch.save(spec, specpath) 371 | 372 | 373 | ### add ### 374 | def dataset_check(audiopaths_sid_text, hparams): 375 | audiopaths_sid_text_list = load_filepaths_and_text(audiopaths_sid_text) 376 | for audiopath, _, _ in tqdm(audiopaths_sid_text_list): ### add tqdm### 377 | 378 | ### modify ### 379 | audio_path = copy.deepcopy(audiopath) 380 | audiopath = os.path.join(hparams.data_path, audiopath) 381 | try: 382 | audio, sampling_rate = load_wav_to_torch(audiopath) 383 | except: 384 | with open(audiopaths_sid_text, mode="r", encoding="utf-8") as f: 385 | lines = f.readlines() 386 | for idx, line in enumerate(lines): 387 | check_path, sentence, spk_id = line.split("|") 388 | if check_path == audio_path: 389 | remove_path = lines.pop(idx) 390 | with open(audiopaths_sid_text, mode="w", encoding="utf-8") as f: 391 | f.writelines(lines) 392 | print(f"File is not found!!! path={remove_path} ==> Removed.") 393 | continue 394 | ############## 395 | 396 | if sampling_rate != hparams.sampling_rate: 397 | raise ValueError("{} {} SR doesn't match target {} SR".format( 398 | sampling_rate, hparams.sampling_rate)) 399 | 400 | 401 | ### add ### 402 | def infer_text_process(text, hps): 403 | 404 | lang = hps.data.languages 405 | add_blank = hps.data.add_blank 406 | 407 | # get_text function 408 | text_norm, tone = text_to_sequence(text, lang) 409 | if add_blank: 410 | text_norm = commons.intersperse(text_norm, 0) 411 | tone = commons.intersperse(tone, 0) 412 | 413 | text_length = torch.tensor(len(text_norm), dtype=torch.int64) 414 | text_length = torch.unsqueeze(torch.tensor(len(text_norm), dtype=torch.int64),dim=0) 415 | text_norm = torch.unsqueeze(torch.LongTensor(text_norm), dim=0) 416 | tone = torch.unsqueeze(torch.LongTensor(tone), dim=0) 417 | 418 | 419 | return text_norm, tone, text_length 420 | ########### -------------------------------------------------------------------------------- /dataset/preprocess.py: -------------------------------------------------------------------------------- 1 | import librosa 2 | import os 3 | import soundfile 4 | from tqdm import tqdm 5 | import random 6 | import argparse 7 | 8 | def main(dataset_dir:str = "./jvs_ver1/", target_sr:int = 22050): 9 | 10 | use_wav_folder = ["parallel100", "nonpara30"] #, "whisper10","falset10"] 11 | text_list = list() 12 | 13 | file_count = 0 14 | sentence_count = 0 15 | for spk_idx in range(100): 16 | spk_idx += 1 17 | spk_name = "jvs" + str(spk_idx).zfill(3) 18 | for folder in use_wav_folder: 19 | target_folder = os.path.join(dataset_dir, spk_name, folder, "wav24kHz16bit") 20 | results_folder = os.path.join(dataset_dir, spk_name, folder, f"wav{target_sr}Hz16bit") 21 | os.makedirs(results_folder, exist_ok=True) 22 | 23 | for filename in tqdm(os.listdir(target_folder), desc=spk_name): 24 | wav_path = os.path.join(target_folder, filename) 25 | y, sr = librosa.load(wav_path, sr=24000) 26 | y_converted = librosa.resample(y, orig_sr=24000, target_sr=target_sr) 27 | save_path = os.path.join(results_folder, filename) 28 | soundfile.write(save_path, y_converted, target_sr) 29 | 30 | txt_path = os.path.join(dataset_dir, spk_name, folder, "transcripts_utf8.txt") 31 | for txt in read_txt(txt_path): 32 | if txt == "\n": 33 | continue 34 | name, sentence = txt.split(":") 35 | sentence = sentence.replace("\n", "") 36 | wav_filepath = os.path.join(spk_name, folder, f"wav{target_sr}Hz16bit", name+".wav") 37 | 38 | out_txt = wav_filepath + "|" + sentence + "|" + spk_name + "\n" 39 | text_list.append(out_txt) 40 | 41 | max_n = len(text_list) 42 | test_list = list() 43 | for _ in range(int(max_n * 0.005)): 44 | n = len(text_list) 45 | idx = random.randint(9, int(n-1)) 46 | txt = text_list.pop(idx) 47 | test_list.append(txt) 48 | 49 | 50 | max_n = len(text_list) 51 | val_list = list() 52 | for _ in range(int(max_n * 0.005)): 53 | n = len(text_list) 54 | idx = random.randint(9, int(n-1)) 55 | txt = text_list.pop(idx) 56 | val_list.append(txt) 57 | 58 | write_txt(f"./filelists/jvs_train_{target_sr}.txt", text_list) 59 | write_txt(f"./filelists/jvs_val_{target_sr}.txt", val_list) 60 | write_txt(f"./filelists/jvs_test_{target_sr}.txt", test_list) 61 | 62 | 63 | return 0 64 | 65 | def read_txt(path): 66 | with open(path, mode="r", encoding="utf-8")as f: 67 | lines = f.readlines() 68 | return lines 69 | 70 | def write_txt(path, lines): 71 | with open(path, mode="w", encoding="utf-8")as f: 72 | f.writelines(lines) 73 | 74 | 75 | if __name__ == "__main__": 76 | 77 | parser = argparse.ArgumentParser() 78 | 79 | parser.add_argument('--folder_path', 80 | type=str, 81 | required=True, 82 | help='Path to jvs corpus folder') 83 | parser.add_argument('--sampling_rate', 84 | type=str, 85 | required=True, 86 | help='Target sampling rate') 87 | 88 | args = parser.parse_args() 89 | 90 | main(dataset_dir=args.folder_path, target_sr=int(args.sampling_rate)) -------------------------------------------------------------------------------- /filelists/vctk_val_g2p.txt: -------------------------------------------------------------------------------- 1 | wav22_silence_trimmed_wav/p225/p225_357_mic1.wav|{IH T} {IH Z} {AH} {S AY N} {AH V} {HH OW P}.|p225 2 | wav22_silence_trimmed_wav/p225/p225_358_mic1.wav|{DH AH} {K AH M IH T M AH N T} {W AA Z} {N AA T} {L AO NG}, {AE N D} {IH T} {W AA Z} {W ER TH} {DH AH} {R IH S K}.|p225 3 | wav22_silence_trimmed_wav/p225/p225_359_mic1.wav|{B AH T} {W IY} {W EH L K AH M} {DH IH S} {D AA K Y AH M EH N T}.|p225 4 | wav22_silence_trimmed_wav/p226/p226_365_mic1.wav|{AY} {HH AE V} {G AA T} {AH} {W AY F} {T UW} {F IY D}.|p226 5 | wav22_silence_trimmed_wav/p226/p226_366_mic1.wav|{IH T S} {N AA T} {P R IH T IY}, {B AH T} {IH T S} {IH F EH K T IH V}.|p226 6 | wav22_silence_trimmed_wav/p226/p226_367_mic1.wav|{EH K W AH T IY} {D IH K L AY N D} {T UW} {K AA M EH N T}.|p226 7 | wav22_silence_trimmed_wav/p227/p227_397_mic1.wav|{DH EY} {HH AE V} {N AA T} {G AA T} {EH N IY W AH N}.|p227 8 | wav22_silence_trimmed_wav/p227/p227_398_mic1.wav|{AY} {K AE N} {HH AA R D L IY} {B IH L IY V} {IH T}.|p227 9 | wav22_silence_trimmed_wav/p227/p227_399_mic1.wav|{HH AW} {D UW} {Y UW} {T EY K} {DH EH M} {AH W EY}?|p227 10 | wav22_silence_trimmed_wav/p228/p228_366_mic1.wav|{B AH T} {AY} {M AE N AH JH D}.|p228 11 | wav22_silence_trimmed_wav/p228/p228_367_mic1.wav|{B AY} {DH AE T} {T AY M}, {HH AW EH V ER}, {IH T} {W AA Z} {AO L R EH D IY} {T UW} {L EY T}.|p228 12 | wav22_silence_trimmed_wav/p228/p228_368_mic1.wav|{AY} {AE M} {N AA T} {W IH L IH NG} {T UW} {S EY} {EH N IY TH IH NG} {AH B AW T} {EH N IY} {K AH P AH L}.|p228 13 | wav22_silence_trimmed_wav/p229/p229_387_mic1.wav|{DH AE T} {T IY M} {IH Z} {D UW} {T UW} {B IY} {AH N AW N S T} {DH IH S} {M AO R N IH NG}.|p229 14 | wav22_silence_trimmed_wav/p229/p229_388_mic1.wav|{IH T} {HH AE Z} {B IH N} {AH} {L AH V L IY} {F AE M AH L IY} {AH K EY ZH AH N}.|p229 15 | wav22_silence_trimmed_wav/p229/p229_389_mic1.wav|{DH AE T S} {DH AH} {N EY M} {AH V} {DH AH} {G EY M}, {DH OW}.|p229 16 | wav22_silence_trimmed_wav/p230/p230_411_mic1.wav|{T R IY T M AH N T} {IH Z} {N AA T} {AH N} {IH SH UW} {W IH DH} {DH IY Z} {P IY P AH L}.|p230 17 | wav22_silence_trimmed_wav/p230/p230_413_mic1.wav|{IY V IH N} {IH F} {DH EY} {K AH M} {AW T} {P L EY IH NG} {AH} {F IH Z IH K AH L} {G EY M}, {W IY} {K AE N} {K OW P}.|p230 18 | wav22_silence_trimmed_wav/p230/p230_414_mic1.wav|{IH T} {IH Z} {DH AH} {OW L D} {S T AO R IY}.|p230 19 | wav22_silence_trimmed_wav/p231/p231_471_mic1.wav|{D EH N IH S} {W AA Z} {N AA T} {S OW} {SH UH R}.|p231 20 | wav22_silence_trimmed_wav/p231/p231_472_mic1.wav|{HH IY} {F EH L T} {IH T} {W AA Z} {DH AH} {R AY T} {T AY M}.|p231 21 | wav22_silence_trimmed_wav/p231/p231_473_mic1.wav|{IH T} {W AA Z} {K L IH R} {AA N} {TH ER Z D EY}.|p231 22 | wav22_silence_trimmed_wav/p232/p232_410_mic1.wav|{AY} {W AA Z} {IH N} {AH} {P AH Z IH SH AH N} {T UW} {CH AE L AH N JH} {F AO R} {DH IH S} {IH V EH N T} {AE N D} {D IH D AH N T}.|p232 23 | wav22_silence_trimmed_wav/p232/p232_411_mic1.wav|{Y UW} {W AO N T IH D} {DH AH} {EH V AH D AH N S}.|p232 24 | wav22_silence_trimmed_wav/p232/p232_412_mic1.wav|{S IY N Y ER} {M AE N AH JH M AH N T} {IH N} {S K AA T L AH N D} {TH R UW} {IH T S} {W EY T} {B IH HH AY N D} {DH AH} {AO R K AH S T R AH}.|p232 25 | wav22_silence_trimmed_wav/p233/p233_387_mic1.wav|{IH T} {W AA Z} {AH} {B R EH TH T EY K IH NG} {M OW M AH N T}.|p233 26 | wav22_silence_trimmed_wav/p233/p233_388_mic1.wav|{AY} {W AA Z} {IH N} {JH EY L} {F AO R} {F AY V} {Y IH R Z}.|p233 27 | wav22_silence_trimmed_wav/p233/p233_389_mic1.wav|{W AH T} {HH AE P AH N D} {IH N} {DH AH} {S AH M ER}?|p233 28 | wav22_silence_trimmed_wav/p234/p234_356_mic1.wav|{DH AE T} {IH Z} {DH AH} {L EH S AH N} {AH V} {DH AH} {L AE S T} {TH R IY} {W IY K S}.|p234 29 | wav22_silence_trimmed_wav/p234/p234_357_mic1.wav|{DH AH} {S IH T IY} {W EH L K AH M D} {DH AH} {N AH M B ER Z}.|p234 30 | wav22_silence_trimmed_wav/p234/p234_358_mic1.wav|{S AH M TH IH NG} {IH Z} {G OW IH NG} {T UW} {HH AE V} {T UW} {G IH V}.|p234 31 | wav22_silence_trimmed_wav/p236/p236_498_mic1.wav|{AY} {W AA Z} {R EH D IY} {F AO R} {DH IH S}.|p236 32 | wav22_silence_trimmed_wav/p236/p236_499_mic1.wav|{IH N} {F AE K T}, {W IY} {W ER} {AO L} {OW V ER} {DH AH} {SH AA P}.|p236 33 | wav22_silence_trimmed_wav/p236/p236_500_mic1.wav|{AH} {S K AA T S M AH N} {HH AE Z} {T UW} {D IH F EH N D} {HH IH Z} {K AE S AH L}.|p236 34 | wav22_silence_trimmed_wav/p237/p237_346_mic1.wav|{DH AH} {AH N AW N S M EH N T} {W AA Z} {M EY D} {AE F T ER} {IH N K W AY ER IY Z} {F R AH M} {AH} {N AE SH AH N AH L} {N UW Z P EY P ER}.|p237 35 | wav22_silence_trimmed_wav/p237/p237_347_mic1.wav|{IH N D IY D}, {IH T} {W AA Z} {M AE JH IH K AH L}.|p237 36 | wav22_silence_trimmed_wav/p237/p237_348_mic1.wav|{L UH K} {AE T} {DH AH} {W IH T N AH S AH Z}.|p237 37 | wav22_silence_trimmed_wav/p238/p238_454_mic1.wav|{S IH M AH L ER} {M EH ZH ER Z} {AA R} {IH K S P EH K T AH D} {IH N} {IH NG G L AH N D} {AE N D} {W EY L Z}.|p238 38 | wav22_silence_trimmed_wav/p238/p238_455_mic1.wav|{S T EY JH K OW CH} {IH Z} {S IY K IH NG} {AH} {N UW} {F AH N AE N S} {D ER EH K T ER}.|p238 39 | wav22_silence_trimmed_wav/p238/p238_456_mic1.wav|{N AW} {IH T} {HH AE Z} {D AH B AH L D} {IH N} {S AY Z}.|p238 40 | wav22_silence_trimmed_wav/p239/p239_498_mic1.wav|{S AE T ER D IY Z} {M AE CH} {W AA Z} {F EH R L IY} {S T R EY T F AO R W ER D}.|p239 41 | wav22_silence_trimmed_wav/p239/p239_499_mic1.wav|{AY} {AE M} {N AA T} {AH N D UW L IY} {S ER P R AY Z D}.|p239 42 | wav22_silence_trimmed_wav/p239/p239_500_mic1.wav|{AY AH N} {D AH NG K AH N} {S M IH TH} {IH Z} {R AO NG}.|p239 43 | wav22_silence_trimmed_wav/p240/p240_375_mic1.wav|{W IY} {AA R} {L UH K IH NG} {F AO R} {P ER F EH K SH AH N}.|p240 44 | wav22_silence_trimmed_wav/p240/p240_376_mic1.wav|{AA N} {F Y UW AH L}, {DH AH} {CH AE N S AH L ER} {HH AE Z} {AH} {N AH M B ER} {AH V} {AA P SH AH N Z}.|p240 45 | wav22_silence_trimmed_wav/p240/p240_377_mic1.wav|{HH IY} {L EY T ER} {B IH K EY M} {AH} {R IH S P EH K T IH D} {HH AY} {K AO R T} {JH AH JH}.|p240 46 | wav22_silence_trimmed_wav/p241/p241_369_mic1.wav|{DH EH R} {W AA Z} {N OW} {HH IH N T} {AH V} {S K AE N D AH L}.|p241 47 | wav22_silence_trimmed_wav/p241/p241_370_mic1.wav|{S EY F T IY} {W AA Z} {AO L S OW} {AH N} {IH SH UW}.|p241 48 | wav22_silence_trimmed_wav/p241/p241_371_mic1.wav|{W EH L}, {DH AH} {B IH G} {D EY} {HH AE Z} {ER AY V D}.|p241 49 | wav22_silence_trimmed_wav/p243/p243_393_mic1.wav|{B AH T} {S UW N} {DH AH} {G EY M} {D R AA P T} {B AE K} {IH N T UW} {IH T S} {OW L D} {W EY Z}.|p243 50 | wav22_silence_trimmed_wav/p243/p243_394_mic1.wav|{IH N} {M AY} {AH P IH N Y AH N}, {EH N IY} {T R AE N S F ER} {T UW} {AH} {F AO R AH N} {K L AH B} {W IH L} {HH EH L P} {HH IH M} {P R AA G R AH S}.|p243 51 | wav22_silence_trimmed_wav/p243/p243_395_mic1.wav|{IH N} {DH AE T} {S IH CH UW EY SH AH N}, {DH EY} {TH AO T} {W IY} {SH UH D} {B IY} {B IY T AH N}.|p243 52 | wav22_silence_trimmed_wav/p244/p244_419_mic1.wav|{B IH L D IH NG} {AH N} {AH N D ER G R AW N D}, {F AO R} {IH G Z AE M P AH L}, {HH AE Z} {P R UW V D} {AH} {N AY T M EH R}.|p244 53 | wav22_silence_trimmed_wav/p244/p244_420_mic1.wav|{DH AE T} {M AY T} {M IY N} {AH N AH DH ER} {D IH L EY}.|p244 54 | wav22_silence_trimmed_wav/p244/p244_421_mic1.wav|{W AH N} {D EY}, {HH IY} {TH R UW} {DH AH} {B EY B IY} {AH G EH N S T} {AH} {W AO L}.|p244 55 | wav22_silence_trimmed_wav/p245/p245_354_mic1.wav|{IH T} {W AA Z} {V EH R IY} {F AO R M AH L}.|p245 56 | wav22_silence_trimmed_wav/p245/p245_355_mic1.wav|{DH AH} {P R EH SH ER} {IH Z} {AA N} {S EH L T IH K} {IH N} {DH AH} {M EY N}.|p245 57 | wav22_silence_trimmed_wav/p245/p245_356_mic1.wav|{IH T S} {N AA T} {DH AE T} {K L IH R}-{K AH T}.|p245 58 | wav22_silence_trimmed_wav/p246/p246_355_mic1.wav|{IH T} {IH Z} {AH} {M AH M AO R IY AH L}.|p246 59 | wav22_silence_trimmed_wav/p246/p246_357_mic1.wav|{HH IY} {IH Z} {V EH R IY} {K AH N S ER N D} {AH B AW T} {HH IH Z} {F IH T N AH S} {AE Z} {AH} {F UH T B AO L ER}.|p246 60 | wav22_silence_trimmed_wav/p246/p246_358_mic1.wav|{DH EH R} {T AY M} {HH AE Z} {G AO N}.|p246 61 | wav22_silence_trimmed_wav/p247/p247_473_mic1.wav|{W IY} {W IH L} {N AW} {L UH K} {IH N T UW} {IH T S} {HH IH S T AO R IH K AH L} {B AE K G R AW N D}.|p247 62 | wav22_silence_trimmed_wav/p247/p247_474_mic1.wav|{DH EY} {W ER} {IH M P R EH S IH V} {AH G EH N S T} {F R AE N S}.|p247 63 | wav22_silence_trimmed_wav/p247/p247_475_mic1.wav|{DH AH} {CH IY F} {K AA N S T AH B AH L} {HH AE Z} {R IH T AY R D}.|p247 64 | wav22_silence_trimmed_wav/p248/p248_371_mic1.wav|{W EH DH ER} {HH IH Z} {S T AE N S} {IH Z} {SH EH R D} {B AY} {DH AH} {IH N K AH M IH NG} {M AE N AH JH ER} {IH Z} {AH N AH DH ER} {M AE T ER}.|p248 65 | wav22_silence_trimmed_wav/p248/p248_372_mic1.wav|{K AH L ER} {W AA Z} {AE T} {DH AH} {K AO R} {AH V} {HH IH Z} {L AY F}.|p248 66 | wav22_silence_trimmed_wav/p248/p248_373_mic1.wav|{HH IY Z} {D IH L AY T AH D}, {T UW}, {W IH DH} {DH AH} {N UW} {P R EH M AH S AH Z}.|p248 67 | wav22_silence_trimmed_wav/p249/p249_349_mic1.wav|{DH AH} {D IH S IH ZH AH N} {IH Z} {AH N} {AE B S AH L UW T} {D IH S G R EY S}.|p249 68 | wav22_silence_trimmed_wav/p249/p249_350_mic1.wav|{IH T} {IH Z} {HH AO R AH B AH L}.|p249 69 | wav22_silence_trimmed_wav/p249/p249_351_mic1.wav|{IH T} {W AA Z} {V EH R IY} {F AO R M AH L}.|p249 70 | wav22_silence_trimmed_wav/p250/p250_491_mic1.wav|{AY} {T IY} {EY CH} {EY} {EH S} {B IH N} {DH AH} {Y IH R} {AH V} {DH AH} {Y AH NG S T ER} {AE T} {K IH L M AA R N AH K}.|p250 71 | wav22_silence_trimmed_wav/p250/p250_492_mic1.wav|{DH IY Z} {P IY P AH L} {AH T AE K} {DH AH} {K AO R} {AH V} {M AY} {B IH L IY F S}.|p250 72 | wav22_silence_trimmed_wav/p250/p250_493_mic1.wav|{DH AH} {S T R AY K ER} {N IY D AH D} {AH T EH N SH AH N} {B IH F AO R} {HH IY} {K UH D} {R IH Z UW M}.|p250 73 | wav22_silence_trimmed_wav/p251/p251_367_mic1.wav|{HH IY} {HH AE Z} {S IY N} {DH AH} {P AE S T}.|p251 74 | wav22_silence_trimmed_wav/p251/p251_368_mic1.wav|{W IY} {W IH L} {B IY} {HH OW M L AH S}.|p251 75 | wav22_silence_trimmed_wav/p251/p251_369_mic1.wav|{DH AE T} {W UH D} {HH EH L P}.|p251 76 | wav22_silence_trimmed_wav/p252/p252_406_mic1.wav|{D AH Z} {IH T} {M AE T ER}?|p252 77 | wav22_silence_trimmed_wav/p252/p252_407_mic1.wav|{DH AH} {D AA K T ER Z} {AA R} {K W AY T} {P AA Z AH T IH V} {AH B AW T} {M AY} {P R AA G R AH S}.|p252 78 | wav22_silence_trimmed_wav/p252/p252_408_mic1.wav|{DH EY} {HH AE D} {T UW} {HH AE V} {HH AA S P IH T AH L} {T R IY T M AH N T}.|p252 79 | wav22_silence_trimmed_wav/p253/p253_404_mic1.wav|{EH R Z} {TH AE CH ER}{AA Z} {IH N} {DH AH} {R AY T} {P L EY S}, {AE T} {DH AH} {R AY T} {T AY M}.|p253 80 | wav22_silence_trimmed_wav/p253/p253_405_mic1.wav|{AH N T IH L} {DH EY} {K EY M} {T UW} {D UW} {IH T}.|p253 81 | wav22_silence_trimmed_wav/p253/p253_407_mic1.wav|{DH AE T} {M EY} {B IY}.|p253 82 | wav22_silence_trimmed_wav/p254/p254_398_mic1.wav|{AH R IH K S AH N} {W UH D} {HH AE V} {AH P R UW V D}.|p254 83 | wav22_silence_trimmed_wav/p254/p254_399_mic1.wav|{P Y UW P AH L Z} {W ER} {AH L AW D} {HH OW M} {AE T} {L AH N CH T AY M}.|p254 84 | wav22_silence_trimmed_wav/p254/p254_400_mic1.wav|{IH T} {W AA Z} {AH} {B R EH TH T EY K IH NG} {M OW M AH N T}.|p254 85 | wav22_silence_trimmed_wav/p255/p255_376_mic1.wav|{K AH L ER} {W AA Z} {AE T} {DH AH} {K AO R} {AH V} {HH IH Z} {L AY F}.|p255 86 | wav22_silence_trimmed_wav/p255/p255_377_mic1.wav|{HH IY Z} {D IH L AY T AH D}, {T UW}, {W IH DH} {DH AH} {N UW} {P R EH M AH S AH Z}.|p255 87 | wav22_silence_trimmed_wav/p255/p255_378_mic1.wav|{IH T} {W AA Z} {AE Z} {IH F} {IH T} {W AA Z} {AO L} {HH AE P AH N IH NG} {AE T} {AH} {G AA R D AH N} {P AA R T IY}.|p255 88 | wav22_silence_trimmed_wav/p256/p256_315_mic1.wav|{DH EY} {W ER} {F AE N T AE S T IH K}.|p256 89 | wav22_silence_trimmed_wav/p256/p256_316_mic1.wav|{DH EH R} {IH Z} {N OW} {AH DH ER} {S AH L UW SH AH N} {T UW} {K AH N JH EH S CH AH N} {IH N} {EH D AH N B ER OW}.|p256 90 | wav22_silence_trimmed_wav/p256/p256_317_mic1.wav|{SH IY} {IH Z} {V EH R IY} {D IH S T IH NG K T IH V}.|p256 91 | wav22_silence_trimmed_wav/p257/p257_429_mic1.wav|{IH T S} {AH} {M IH R AH K AH L}.|p257 92 | wav22_silence_trimmed_wav/p257/p257_430_mic1.wav|{DH AH} {F ER S T} {M IH N AH S T ER} {IH Z} {AA B V IY AH S L IY} {K AH N S ER N D} {T UW} {HH IY R} {AH B AW T} {DH IH S} {IH N S AH D AH N T}.|p257 93 | wav22_silence_trimmed_wav/p257/p257_431_mic1.wav|{DH AE T} {W AA Z} {DH AH} {F ER S T} {T AY M} {AY} {W ER K T} {W IH DH} {R IH CH ER D}.|p257 94 | wav22_silence_trimmed_wav/p258/p258_409_mic1.wav|{DH AH} {V IH ZH W AH L} {AA R T S} {K AH M IH T IY} {T UH K} {DH IH S} {D IH S IH ZH AH N} {IH N} {D IH S EH M B ER}.|p258 95 | wav22_silence_trimmed_wav/p258/p258_410_mic1.wav|{AH DH ER} {M EH M B ER Z} {AH V} {DH AH} {F AE M AH L IY} {W ER} {T UW} {AH P S EH T} {T UW} {K AA M EH N T} {L AE S T} {N AY T}.|p258 96 | wav22_silence_trimmed_wav/p258/p258_411_mic1.wav|{AY} {B IH L IY V} {DH AH} {G AE P} {IH Z} {N IH R L IY} {DH EH R}.|p258 97 | wav22_silence_trimmed_wav/p259/p259_476_mic1.wav|{HH IY} {D IH Z ER V D} {AH} {R EH D} {K AA R D}.|p259 98 | wav22_silence_trimmed_wav/p259/p259_477_mic1.wav|{DH AH} {P R AY S} {P EY D} {F AO R} {IH T} {W AA Z} {N AA T} {D IH S K L OW Z D}.|p259 99 | wav22_silence_trimmed_wav/p259/p259_478_mic1.wav|{Y UW} {HH AE V} {T UW} {HH AE V} {AH} {P R UW V AH N} {T R AE K} {R EH K ER D} {T UW} {G EH T} {DH EH M}.|p259 100 | wav22_silence_trimmed_wav/p260/p260_352_mic1.wav|{DH EY} {SH UH D} {T EY K} {DH EH R} {M OW B AH L} {F OW N Z}.|p260 101 | wav22_silence_trimmed_wav/p260/p260_353_mic1.wav|{EY T} {M AH N TH S} {L EY T ER}, {HH IY} {W AA Z} {D EH D}.|p260 102 | wav22_silence_trimmed_wav/p260/p260_354_mic1.wav|{AY} {EY CH} {EY} {V IY} {IY} {AH} {D R IY M}.|p260 103 | wav22_silence_trimmed_wav/p261/p261_469_mic1.wav|{DH EY} {AA R} {D IH F AY N D} {B AY} {L AO}.|p261 104 | wav22_silence_trimmed_wav/p261/p261_470_mic1.wav|{IH F} {DH AE T S} {DH AH} {K EY S}, {HH IY} {W IH L} {S T R AH G AH L}.|p261 105 | wav22_silence_trimmed_wav/p261/p261_471_mic1.wav|{AY M} {V EH R IY} {P L IY Z D} {F AO R} {DH AH} {K L AH B} {AE N D} {F AO R} {M AY S EH L F}.|p261 106 | wav22_silence_trimmed_wav/p262/p262_389_mic1.wav|{IH T} {W AA Z} {AH} {B R EH TH T EY K IH NG} {M OW M AH N T}.|p262 107 | wav22_silence_trimmed_wav/p262/p262_390_mic1.wav|{IH T} {W AA Z} {AA N} {F AY ER}.|p262 108 | wav22_silence_trimmed_wav/p262/p262_391_mic1.wav|{DH IH S} {IH Z} {AH} {M EY JH ER} {S T EH P} {F AO R W ER D} {F AO R} {K EH R ER Z}.|p262 109 | wav22_silence_trimmed_wav/p263/p263_468_mic1.wav|{IH T S} {B IH N} {AH} {L AO NG}, {L AO NG} {JH ER N IY}.|p263 110 | wav22_silence_trimmed_wav/p263/p263_469_mic1.wav|{AY} {JH AH S T} {G AA T} {AH N D ER} {DH AH} {B AO L} {AH} {B IH T}.|p263 111 | wav22_silence_trimmed_wav/p263/p263_470_mic1.wav|{DH EH R} {IH Z} {AH} {S IH M AH L ER} {S T AO R IY} {F AO R} {M IH L K} {AE N D} {D EH R IY} {P R AA D AH K T S}.|p263 112 | wav22_silence_trimmed_wav/p264/p264_490_mic1.wav|{W IY V} {G AA T} {G UH D} {AA P SH AH N Z}.|p264 113 | wav22_silence_trimmed_wav/p264/p264_491_mic1.wav|{DH IH S} {AA P ER EY SH AH N} {W IH L} {CH EY N JH} {HH ER} {L AY F}.|p264 114 | wav22_silence_trimmed_wav/p264/p264_492_mic1.wav|{N UW} {Y AO R K} {IH Z} {M AY} {HH OW M}.|p264 115 | wav22_silence_trimmed_wav/p265/p265_347_mic1.wav|{DH AH} {F AY N AH L} {D IH S IH ZH AH N} {W AA Z} {B IH T W IY N} {S K AA T L AH N D} {AE N D} {DH AH} {R IY P AH B L AH K} {AH V} {AY ER L AH N D}.|p265 116 | wav22_silence_trimmed_wav/p265/p265_348_mic1.wav|{AH} {R IY V Y UW} {IH Z} {AH N D ER} {W EY}.|p265 117 | wav22_silence_trimmed_wav/p265/p265_349_mic1.wav|{S EY F T IY} {W AA Z} {AO L S OW} {AH N} {IH SH UW}.|p265 118 | wav22_silence_trimmed_wav/p266/p266_419_mic1.wav|{IH T} {IH Z} {L IY G AH L IY} {B AY N D IH NG}.|p266 119 | wav22_silence_trimmed_wav/p266/p266_420_mic1.wav|{P EH R IH S} {IH Z} {N AA T} {OW N L IY} {Y AH NG}.|p266 120 | wav22_silence_trimmed_wav/p266/p266_421_mic1.wav|{S AH CH} {AE Z} {IH T} {IH Z}.|p266 121 | wav22_silence_trimmed_wav/p267/p267_416_mic1.wav|{B AY} {DH EH N}, {AH} {M AE S IH V} {L IY G AH L} {B AE T AH L} {IH Z} {L AY K L IY} {T UW} {HH AE V} {S T AA R T AH D}.|p267 122 | wav22_silence_trimmed_wav/p267/p267_417_mic1.wav|{DH IH S} {IH Z} {N OW} {R AH F L EH K SH AH N} {AA N} {R EY N JH ER Z}.|p267 123 | wav22_silence_trimmed_wav/p267/p267_418_mic1.wav|{F ER DH ER} {N UW} {EH K W AH T IY} {W AA Z} {R EY Z D} {IH N} {AH} {P L EY S IH NG} {IH N} {JH AE N Y UW EH R IY}.|p267 124 | wav22_silence_trimmed_wav/p268/p268_406_mic1.wav|{W AA SH IH NG T AH N} {IH Z} {K AH N S UW M D} {B AY} {DH AH} {K R AY S AH S}.|p268 125 | wav22_silence_trimmed_wav/p268/p268_407_mic1.wav|{N AW}, {S AH D AH N L IY}, {W IY} {HH AE V} {DH IH S} {N UW} {L AE N D S K EY P}.|p268 126 | wav22_silence_trimmed_wav/p268/p268_408_mic1.wav|{AY} {SH UH D} {TH IH NG K} {S OW}, {T UW}.|p268 127 | wav22_silence_trimmed_wav/p269/p269_398_mic1.wav|{AH P AA R T} {F R AH M} {DH AH} {R IH Z AH L T} {W IY} {HH AE V} {T UW} {B IY} {HH AE P IY} {W IH DH} {AW ER} {P ER F AO R M AH N S}.|p269 128 | wav22_silence_trimmed_wav/p269/p269_399_mic1.wav|{W EH L}, {IH T} {D IH D} {L AE S T} {T AY M}, {HH IY} {W AA Z} {R IY M AY N D AH D}.|p269 129 | wav22_silence_trimmed_wav/p269/p269_400_mic1.wav|{B AY} {DH EH N}, {AH} {M AE S IH V} {L IY G AH L} {B AE T AH L} {IH Z} {L AY K L IY} {T UW} {HH AE V} {S T AA R T AH D}.|p269 130 | wav22_silence_trimmed_wav/p270/p270_457_mic1.wav|{W IY} {W IH L} {D IY L} {W IH DH} {DH AH} {R EH F Y UW JH IY Z}.|p270 131 | wav22_silence_trimmed_wav/p270/p270_458_mic1.wav|{IH F} {DH AE T S} {DH AH} {K EY S}, {HH IY} {W IH L} {S T R AH G AH L}.|p270 132 | wav22_silence_trimmed_wav/p270/p270_459_mic1.wav|{W IY} {AA R} {OW V ER} {DH AH} {M UW N}.|p270 133 | wav22_silence_trimmed_wav/p271/p271_449_mic1.wav|{IH T} {W IH L} {D AH T ER M AH N} {W EH DH ER} {AH N} {AH F EH N S} {HH AE Z} {AH K ER D}.|p271 134 | wav22_silence_trimmed_wav/p271/p271_450_mic1.wav|{W AA Z} {IH T} {W ER TH} {IH T}?|p271 135 | wav22_silence_trimmed_wav/p271/p271_451_mic1.wav|{IH T} {W AA Z} {IH G N AO R D}.|p271 136 | wav22_silence_trimmed_wav/p272/p272_407_mic1.wav|{HH AW EH V ER}, {DH AH} {P L EY ER Z} {SH UH D} {HH AE V} {AH} {V OY S} {IH N} {DH IY Z} {M AE T ER Z}.|p272 137 | wav22_silence_trimmed_wav/p272/p272_408_mic1.wav|{HH IY} {HH AE Z} {R IH T AH N} {T UW} {DH AH} {M IH N AH S T ER} {AE F T ER} {M IY T IH NG Z} {AA N} {DH AH} {AY L AH N D}.|p272 138 | wav22_silence_trimmed_wav/p272/p272_409_mic1.wav|{B AH T} {AY} {F EH L T} {IH T} {W AA Z} {IH M P AO R T AH N T} {T UW} {IH N T R AH D UW S} {DH AH} {EH L AH M AH N T} {AH V} {T R AH D IH SH AH N}.|p272 139 | wav22_silence_trimmed_wav/p273/p273_431_mic1.wav|{DH EY} {AA R} {N AA T} {L EH F T} {W IH NG}.|p273 140 | wav22_silence_trimmed_wav/p273/p273_432_mic1.wav|{IH T} {IH Z} {N AA T} {AH} {S T AE N D IH NG} {AA R M IY}.|p273 141 | wav22_silence_trimmed_wav/p273/p273_433_mic1.wav|{S T R EY N JH L IY} {IH N AH F} {AY} {F EH L T} {V EH R IY} {SH AA R P}.|p273 142 | wav22_silence_trimmed_wav/p274/p274_465_mic1.wav|{DH AH} {EH D AH N B ER OW} {AA D IY AH N S} {W AA Z} {EY B AH L} {T UW} {AH N D ER S T AE N D} {DH AH} {D AY AH L AO G}.|p274 143 | wav22_silence_trimmed_wav/p274/p274_466_mic1.wav|{F IH L} {M IH K EH L S AH N} {D IH D} {DH AE T} {L AE S T} {Y IH R}.|p274 144 | wav22_silence_trimmed_wav/p274/p274_467_mic1.wav|{DH AH} {AE N S ER} {W AA Z} {AH P} {DH EH R} {AA N} {DH AH} {S T EY JH}.|p274 145 | wav22_silence_trimmed_wav/p275/p275_424_mic1.wav|{W IY} {AA R} {N AW} {AH P} {AH G EH N S T} {IH T}.|p275 146 | wav22_silence_trimmed_wav/p275/p275_425_mic1.wav|{L AE S T} {Y IH R}, {IH T} {W AA Z} {W AH N} {B AY} {JH AE K} {M AH K AA N AH L}, {DH AH} {F ER S T} {M IH N AH S T ER}.|p275 147 | wav22_silence_trimmed_wav/p275/p275_426_mic1.wav|{AY} {W AA N T} {AH} {K ER IH R}.|p275 148 | wav22_silence_trimmed_wav/p276/p276_460_mic1.wav|{W IY} {W IH L} {P EY} {DH EH R} {B IH L Z}.|p276 149 | wav22_silence_trimmed_wav/p276/p276_461_mic1.wav|{DH IH S} {K AO R T} {HH AE Z} {M EY D} {AH N} {AO R D ER} {W IH CH} {HH AE Z} {N AA T} {B IH N} {AH B Z ER V D}.|p276 150 | wav22_silence_trimmed_wav/p276/p276_462_mic1.wav|{IH T} {W AA Z} {V EH R IY} {P AA Z AH T IH V} {AE N D} {V EH R IY} {HH EH L P F AH L}.|p276 151 | wav22_silence_trimmed_wav/p277/p277_458_mic1.wav|{S K AA T IH SH} {P AH B L IH K} {F IH N AE N S IH Z} {IH M ER JH} {F R AH M} {DH IH S} {R IY V Y UW} {EH N HH AE N S T}.|p277 152 | wav22_silence_trimmed_wav/p277/p277_459_mic1.wav|{SH IY} {AO L S OW} {D IH F EH N D AH D} {DH AH} {L AO R D} {CH AE N S AH L ER Z} {IH G Z IH S T IH NG} {P AW ER Z}.|p277 153 | wav22_silence_trimmed_wav/p277/p277_460_mic1.wav|{AY D} {L AH V} {T UW} {B IY} {L AY K} {P IY T ER}.|p277 154 | wav22_silence_trimmed_wav/p278/p278_405_mic1.wav|{W IY} {M AH S T} {IH M P R UW V} {AW ER} {R IY L EY SH AH N Z} {W IH DH} {G AH V ER N M AH N T}.|p278 155 | wav22_silence_trimmed_wav/p278/p278_406_mic1.wav|{DH AH} {K AE B AH N AH T} {IH Z} {S P L IH T} {OW V ER} {DH AH} {IH SH UW}.|p278 156 | wav22_silence_trimmed_wav/p278/p278_407_mic1.wav|{M EH S} {M AH N EY L} {W AA Z} {K IH L D} {AA N} {IH M P AE K T}.|p278 157 | wav22_silence_trimmed_wav/p279/p279_401_mic1.wav|{N OW B AA D IY} {IY V IH N} {N UW} {IH T} {HH AE D} {HH AE P AH N D}.|p279 158 | wav22_silence_trimmed_wav/p279/p279_402_mic1.wav|{W AH T} {W UH D} {B IY} {DH AH} {P OY N T}?|p279 159 | wav22_silence_trimmed_wav/p279/p279_403_mic1.wav|{DH AH} {W IY K L IY} {AE V ER IH JH} {W AA Z} {TH R IY} {AW ER Z}.|p279 160 | wav22_silence_trimmed_wav/p281/p281_455_mic1.wav|{SH IY} {HH AE Z} {N AW} {B IH N} {R EH JH IH S T ER D} {AE Z} {D IH S EY B AH L D}.|p281 161 | wav22_silence_trimmed_wav/p281/p281_456_mic1.wav|{W IY} {K UH D} {HH AA R D L IY} {B IH L IY V} {IH T}.|p281 162 | wav22_silence_trimmed_wav/p281/p281_457_mic1.wav|{DH AH} {M EH N} {AA R} {V EH R IY} {W ER IY D}.|p281 163 | wav22_silence_trimmed_wav/p282/p282_365_mic1.wav|{B AY} {DH AE T} {T AY M}, {HH AW EH V ER}, {IH T} {W AA Z} {AO L R EH D IY} {T UW} {L EY T}.|p282 164 | wav22_silence_trimmed_wav/p282/p282_366_mic1.wav|{IH T} {W AA Z} {AH} {F AH N IY} {G EY M}.|p282 165 | wav22_silence_trimmed_wav/p282/p282_367_mic1.wav|{DH EH R} {W AA Z AH N T} {AH} {G OW L}.|p282 166 | wav22_silence_trimmed_wav/p283/p283_466_mic1.wav|{DH AH} {EH D AH N B ER OW} {AA D IY AH N S} {W AA Z} {EY B AH L} {T UW} {AH N D ER S T AE N D} {DH AH} {D AY AH L AO G}.|p283 167 | wav22_silence_trimmed_wav/p283/p283_467_mic1.wav|{IH T} {AO L} {B IH G AE N} {AE Z} {AH N} {AE K S AH D AH N T}.|p283 168 | wav22_silence_trimmed_wav/p283/p283_468_mic1.wav|{W IY} {AA R} {G OW IH NG} {TH R UW} {DH AH} {P R AA S EH S}.|p283 169 | wav22_silence_trimmed_wav/p284/p284_420_mic1.wav|{HH IY} {W AA Z} {K AH N V IH K T AH D} {AE T} {G L AE S K OW} {SH EH R AH F} {K AO R T}.|p284 170 | wav22_silence_trimmed_wav/p284/p284_421_mic1.wav|{AH R IH K S AH N} {W UH D} {HH AE V} {AH P R UW V D}.|p284 171 | wav22_silence_trimmed_wav/p284/p284_422_mic1.wav|{AY V} {IH N V EH N T AH D} {AH} {V IH L AH JH} {IH N} {IY S T} {L AA TH IY AH N}.|p284 172 | wav22_silence_trimmed_wav/p285/p285_398_mic1.wav|{TH AE NG K F AH L IY}, {N OW}-{W AH N} {AA N} {DH AH} {B UH S} {IH Z} {T UW} {B AE D L IY} {HH ER T}.|p285 173 | wav22_silence_trimmed_wav/p285/p285_399_mic1.wav|{IH Z} {DH AE T} {P AA S AH B AH L}?|p285 174 | wav22_silence_trimmed_wav/p285/p285_400_mic1.wav|{B AH T}, {IH N} {F AE K T}, {DH AH} {R IH V ER S} {IH Z} {T R UW}.|p285 175 | wav22_silence_trimmed_wav/p286/p286_463_mic1.wav|{W IY} {AA R} {N AW} {AH P} {AH G EH N S T} {IH T}.|p286 176 | wav22_silence_trimmed_wav/p286/p286_464_mic1.wav|{F EY L Y ER} {T UW} {R IY AE K T} {T UW} {IY CH} {W AH N} {K UH D} {M IY N} {AH} {D IH Z AE S T ER}.|p286 177 | wav22_silence_trimmed_wav/p286/p286_465_mic1.wav|{IH T} {IH Z} {V AY T AH L} {T UW} {K L EH R AH F AY} {DH EH R} {R OW L}.|p286 178 | wav22_silence_trimmed_wav/p287/p287_419_mic1.wav|{IH T} {W AA Z} {IH M P AO R T AH N T} {T UW} {W IH N} {DH AH} {S IH NG G AH L Z}.|p287 179 | wav22_silence_trimmed_wav/p287/p287_420_mic1.wav|{AY M} {P L IY Z D} {AH B AW T} {W AH N} {TH IH NG}.|p287 180 | wav22_silence_trimmed_wav/p287/p287_421_mic1.wav|{DH AE T S} {B IH N} {AO L} {ER AW N D} {Y UH R AH P} {W IH DH} {M IY}.|p287 181 | wav22_silence_trimmed_wav/p288/p288_407_mic1.wav|{DH AH} {R IY L} {EH N AH M IY} {IH Z} {IH N} {Y AO R} {OW N} {B AE K} {Y AA R D}.|p288 182 | wav22_silence_trimmed_wav/p288/p288_408_mic1.wav|{DH AH} {V IH ZH W AH L} {AA R T S} {K AH M IH T IY} {T UH K} {DH IH S} {D IH S IH ZH AH N} {IH N} {D IH S EH M B ER}.|p288 183 | wav22_silence_trimmed_wav/p288/p288_409_mic1.wav|{W IY} {N IY D} {DH AH} {CH IY F} {M EH D AH K AH L} {AO F AH S ER} {T UW} {K L EH R AH F AY} {DH AH} {M AE T ER}.|p288 184 | wav22_silence_trimmed_wav/p292/p292_419_mic1.wav|{B AH T} {AH} {S ER P R AY Z} {IH Z} {IH N} {S T AO R}.|p292 185 | wav22_silence_trimmed_wav/p292/p292_420_mic1.wav|{Y UW} {K AE N} {K OW P} {W IH DH} {IH T}.|p292 186 | wav22_silence_trimmed_wav/p292/p292_421_mic1.wav|{HH IY} {W AA Z} {B AE K} {T UW} {S K W EH R} {W AH N}.|p292 187 | wav22_silence_trimmed_wav/p293/p293_395_mic1.wav|{DH IH S} {IH Z} {AH} {K AH M P L IY T L IY} {N UW} {IH K S P IH R IY AH N S} {F AO R} {M IY}.|p293 188 | wav22_silence_trimmed_wav/p293/p293_396_mic1.wav|{AY} {SH UH D} {N EH V ER} {HH AE V} {K AH M} {OW V ER} {T UW} {F AY F}.|p293 189 | wav22_silence_trimmed_wav/p293/p293_397_mic1.wav|{W AH T} {K AY N D} {AH V} {P ER S AH N} {IH Z} {HH IY}?|p293 190 | wav22_silence_trimmed_wav/p294/p294_419_mic1.wav|{W IY} {HH AE V} {EH V ER IY} {R IH S P EH K T} {F AO R} {M IH SH EH L}.|p294 191 | wav22_silence_trimmed_wav/p294/p294_420_mic1.wav|{M R} {B L AH NG K AH T} {W AA Z} {K AH N V IH N S T}.|p294 192 | wav22_silence_trimmed_wav/p294/p294_421_mic1.wav|{DH AH} {AA K SH AH N} {W IH L} {B IY} {HH EH L D} {T AH M AA R OW}.|p294 193 | wav22_silence_trimmed_wav/p295/p295_419_mic1.wav|{EH V R IY TH IH NG} {IH Z} {S ER AW N D AH D} {B AY} {K AH N F Y UW ZH AH N}.|p295 194 | wav22_silence_trimmed_wav/p295/p295_420_mic1.wav|{B AH T}, {AH V} {K AO R S}, {IH T} {IH Z AH N T}.|p295 195 | wav22_silence_trimmed_wav/p295/p295_421_mic1.wav|{W IY} {AA R} {AO L M OW S T} {AE T} {DH AE T} {P OY N T}, {B AH T} {N AA T} {K W AY T}.|p295 196 | wav22_silence_trimmed_wav/p297/p297_419_mic1.wav|{HH AA L IY W UH D} {K AA M AH D IY} {IH Z} {HH AE V IH NG} {AH} {G UH D} {W IY K}.|p297 197 | wav22_silence_trimmed_wav/p297/p297_420_mic1.wav|{Y EH T} {DH AH} {K AH N S EH N S AH S} {IH Z}, {IH T S} {W ER TH} {IH T}.|p297 198 | wav22_silence_trimmed_wav/p297/p297_421_mic1.wav|{IH T} {EH S} {DH AH} {T IH P} {AH V} {DH AH} {AY S B ER G}.|p297 199 | wav22_silence_trimmed_wav/p298/p298_400_mic1.wav|{DH AH} {S AW TH} {AE F R IH K AA N} {D IH F EH N D AH D} {HH IH Z} {W ER K} {EH TH AH K}.|p298 200 | wav22_silence_trimmed_wav/p298/p298_401_mic1.wav|{DH AH} {OW L IH M P IH K} {K AH M IH T IY} {SH UH D} {B IY} {AH SH EY M D} {AH V} {DH EH M S EH L V Z}.|p298 201 | wav22_silence_trimmed_wav/p298/p298_402_mic1.wav|{Y UW} {AA R} {G OW IH NG} {T UW} {G L AE S K OW} {EH R P AO R T}, {AE N D} {N AA T} {K AH M IH NG} {B AE K}.|p298 202 | wav22_silence_trimmed_wav/p299/p299_400_mic1.wav|{W IY} {AA R} {W ER K IH NG} {AA N} {DH AH} {T UW} {TH IH NG Z}.|p299 203 | wav22_silence_trimmed_wav/p299/p299_401_mic1.wav|{W IY} {OW N L IY} {L AO S T} {DH AH} {TH ER D} {G OW L}.|p299 204 | wav22_silence_trimmed_wav/p299/p299_402_mic1.wav|{W IH M AH N} {R IY P AO R T AH D} {M AO R} {D IH P R EH SH AH N} {DH AE N} {M EH N}.|p299 205 | wav22_silence_trimmed_wav/p300/p300_395_mic1.wav|{V IH K T AH R IY}, {SH IY} {IH N S IH S T AH D} {Y EH S T ER D EY}, {W AA Z} {F AA R} {M AO R} {IH M P AO R T AH N T} {DH AE N} {DH AH} {R IH W AO R D Z}.|p300 206 | wav22_silence_trimmed_wav/p300/p300_396_mic1.wav|{F AO R} {AO L} {HH IH Z} {S AH K S EH S AH Z}, {HH IY} {IH Z} {AH K AH S T AH M D} {T UW} {W EY T IH NG} {F AO R} {F UH L F IH L M AH N T}.|p300 207 | wav22_silence_trimmed_wav/p300/p300_397_mic1.wav|{AH} {F EY T AH L} {AE K S AH D AH N T} {IH N K W AY R IY} {IH Z} {AO NG G OW IH NG}.|p300 208 | wav22_silence_trimmed_wav/p301/p301_406_mic1.wav|{JH AE K S AH N} {M EY} {W EH L} {B IY} {R AY T}.|p301 209 | wav22_silence_trimmed_wav/p301/p301_407_mic1.wav|{M AY} {TH AO T S} {AA R} {W IH DH} {DH EH R} {F AE M AH L IY Z}.|p301 210 | wav22_silence_trimmed_wav/p301/p301_408_mic1.wav|{AY} {S AO} {S AH M} {G UH D} {TH IH NG Z}.|p301 211 | wav22_silence_trimmed_wav/p302/p302_311_mic1.wav|{IH T} {P L AE N Z} {T UW} {R IH T ER N} {T UW} {DH IH S} {F IY L D}.|p302 212 | wav22_silence_trimmed_wav/p302/p302_312_mic1.wav|{HH AW EH V ER}, {DH EH R} {W AA Z} {N OW} {HH OW P}, {AE N D} {G L AO R IY} {T UW}, {F AO R} {S K AA T L AH N D}.|p302 213 | wav22_silence_trimmed_wav/p302/p302_313_mic1.wav|{T UW} {AO L} {IH N T EH N T S} {AE N D} {P ER P AH S AH Z}, {HH IY} {R AE N} {DH AH} {SH OW}.|p302 214 | wav22_silence_trimmed_wav/p303/p303_348_mic1.wav|{AY} {K AE N} {AH P IH R} {T UW} {B IY} {N AY S} {AE N D} {L AH V L IY}.|p303 215 | wav22_silence_trimmed_wav/p303/p303_349_mic1.wav|{AY} {D OW N T} {TH IH NG K} {DH AH} {S AO D AH S} {W IH L} {L EY} {D AW N}.|p303 216 | wav22_silence_trimmed_wav/p303/p303_350_mic1.wav|{P IY P AH L} {T EH N D AH D} {T UW} {S T EY} {DH EH R} {F AO R} {S AH M} {T AY M}.|p303 217 | wav22_silence_trimmed_wav/p304/p304_419_mic1.wav|{AY} {K AE N} {AH N D ER S T AE N D} {W AY} {DH EY} {HH AE V} {G AO N}.|p304 218 | wav22_silence_trimmed_wav/p304/p304_420_mic1.wav|{DH AE T}, {DH OW}, {W AA Z} {DH EH N}.|p304 219 | wav22_silence_trimmed_wav/p304/p304_421_mic1.wav|{P IY P AH L} {SH UH D} {N AA T} {B IH K AH M} {K AH M P L EY S AH N T}.|p304 220 | wav22_silence_trimmed_wav/p305/p305_417_mic1.wav|{W IY V} {S T IH L} {G AA T} {AH} {S EY} {IH N} {DH IH S}.|p305 221 | wav22_silence_trimmed_wav/p305/p305_418_mic1.wav|{SH IY} {K AE N T} {S EY} {W EH R} {SH IY} {W AA Z}, {W AH T} {SH IY} {D IH D}.|p305 222 | wav22_silence_trimmed_wav/p305/p305_419_mic1.wav|{AY L} {T EY K} {W AH T} {K AH M Z} {AH L AO NG}.|p305 223 | wav22_silence_trimmed_wav/p306/p306_355_mic1.wav|{AY} {D IH D AH N T} {IY V IH N} {HH AE V} {DH AH} {F ER S T} {AY D IY AH}.|p306 224 | wav22_silence_trimmed_wav/p306/p306_356_mic1.wav|{N OW} {P ER S AH N} {W AA Z} {CH AA R JH D}.|p306 225 | wav22_silence_trimmed_wav/p306/p306_357_mic1.wav|{AY V} {OW N L IY} {M EH T} {HH IH M} {TH R IY} {T AY M Z}.|p306 226 | wav22_silence_trimmed_wav/p307/p307_419_mic1.wav|{IH T S} {AH} {P AA L AH S IY} {W IH CH} {HH AE Z} {W ER K T} {F AO R} {AH S}.|p307 227 | wav22_silence_trimmed_wav/p307/p307_420_mic1.wav|{L OW K AH L IY}, {T UW}, {DH AH} {M}{EH M} {P IY} {IH Z} {AH N D ER} {F AY ER}.|p307 228 | wav22_silence_trimmed_wav/p307/p307_421_mic1.wav|{HH IH Z} {V Y UW Z} {AA R} {HH AA R D L IY} {S ER P R AY Z IH NG}.|p307 229 | wav22_silence_trimmed_wav/p308/p308_418_mic1.wav|{AO R} {S OW} {SH IY} {TH AO T}.|p308 230 | wav22_silence_trimmed_wav/p308/p308_419_mic1.wav|{IH T} {W AA Z} {AH N AH DH ER} {G UH D} {AY D IY AH}.|p308 231 | wav22_silence_trimmed_wav/p308/p308_420_mic1.wav|{DH IH S} {W UH D} {D IH S K ER IH JH} {IH N V EH S T M AH N T} {AE N D} {JH AA B} {K R IY EY SH AH N}.|p308 232 | wav22_silence_trimmed_wav/p310/p310_419_mic1.wav|{IH T} {W UH D} {SH OW} {DH EH M} {AH} {P AA Z AH T IH V} {W EY} {F AO R W ER D}.|p310 233 | wav22_silence_trimmed_wav/p310/p310_420_mic1.wav|{AY} {TH IH NG K} {IH T S} {AH} {D IH S G R EY S}.|p310 234 | wav22_silence_trimmed_wav/p310/p310_421_mic1.wav|{S OW} {HH IY} {SH UH D} {B IY}.|p310 235 | wav22_silence_trimmed_wav/p311/p311_418_mic1.wav|{AY} {AH G R IY} {W IH DH} {DH EH M}.|p311 236 | wav22_silence_trimmed_wav/p311/p311_419_mic1.wav|{AY} {F IY L} {V EH R IY} {S T R AO NG L IY} {AH B AW T} {DH AE T}.|p311 237 | wav22_silence_trimmed_wav/p311/p311_420_mic1.wav|{W IY} {W IH L} {L IH S AH N} {T UW} {K AA L IY G Z}.|p311 238 | wav22_silence_trimmed_wav/p312/p312_414_mic1.wav|{DH AH} {P R AH P OW Z AH L Z} {HH AE V} {B IH N} {T EY K AH N} {AW T} {AH V} {K AA N T EH K S T}.|p312 239 | wav22_silence_trimmed_wav/p312/p312_415_mic1.wav|{G AA L F} {HH AE Z} {L AO S T} {AH} {G R EY T} {K EH R IH K T ER}.|p312 240 | wav22_silence_trimmed_wav/p312/p312_416_mic1.wav|{HH AW EH V ER}, {IH T} {HH AE Z} {HH AE D} {OW N L IY} {AH} {L IH M AH T AH D} {AH P T EY K}.|p312 241 | wav22_silence_trimmed_wav/p313/p313_418_mic1.wav|{W UH D} {DH EY} {W ER K} {T AH G EH DH ER} {AH G EH N}?|p313 242 | wav22_silence_trimmed_wav/p313/p313_419_mic1.wav|{AY} {D IH D AH N T} {P L EY} {AO L} {DH AE T} {W EH L}.|p313 243 | wav22_silence_trimmed_wav/p313/p313_420_mic1.wav|{HH UW} {K AE N} {T EH L}?|p313 244 | wav22_silence_trimmed_wav/p314/p314_418_mic1.wav|{AE Z} {F AO R} {OW AH N}, {K L AE S} {IH Z} {P ER M AA N EH N T}.|p314 245 | wav22_silence_trimmed_wav/p314/p314_419_mic1.wav|{T IY} {EY CH} {IY} {P EY N} {W AA Z} {AO L M OW S T} {T UW} {M AH CH} {T UW} {B EH R}.|p314 246 | wav22_silence_trimmed_wav/p314/p314_420_mic1.wav|{M AY} {TH AO T S} {AA R} {W IH DH} {DH EH R} {F AE M AH L IY Z}.|p314 247 | wav22_silence_trimmed_wav/p316/p316_418_mic1.wav|{DH IH S} {IH Z} {DH AH} {G R EY T AH S T} {M OW M AH N T} {AH V} {AW ER} {L AY V Z}.|p316 248 | wav22_silence_trimmed_wav/p316/p316_419_mic1.wav|{AA D IY AH N S AH Z} {DH EH R} {W ER} {SH AA K T}.|p316 249 | wav22_silence_trimmed_wav/p316/p316_420_mic1.wav|{DH EY} {R IH S P AA N D IH D} {IH N} {DH AH} {M OW S T} {P AA Z AH T IH V} {W EY}.|p316 250 | wav22_silence_trimmed_wav/p317/p317_418_mic1.wav|{SH IY} {W AA Z} {IH K S T R IY M L IY} {R IH JH AH D}.|p317 251 | wav22_silence_trimmed_wav/p317/p317_419_mic1.wav|{N IY DH ER} {S AY D} {K AE N} {W IH N} {DH IH S} {W AO R}.|p317 252 | wav22_silence_trimmed_wav/p317/p317_420_mic1.wav|{B AA R OW IH NG} {P AW ER} {W AA Z} {W AH N} {AH V} {DH OW Z} {IH M P R UW V M AH N T S}.|p317 253 | wav22_silence_trimmed_wav/p318/p318_419_mic1.wav|{AY} {D IH D AH N T} {P L EY} {AO L} {DH AE T} {W EH L}.|p318 254 | wav22_silence_trimmed_wav/p318/p318_420_mic1.wav|{HH UW} {K AE N} {T EH L}?|p318 255 | wav22_silence_trimmed_wav/p318/p318_421_mic1.wav|{IH T} {AH P IH R Z} {T UW} {HH AE V} {B IH N} {K L OW Z D} {D AW N}.|p318 256 | wav22_silence_trimmed_wav/p323/p323_418_mic1.wav|{W IY} {T EH N D} {T UW} {K AH M} {G UH D} {AE T} {DH AH} {EH N D}.|p323 257 | wav22_silence_trimmed_wav/p323/p323_419_mic1.wav|{AY} {HH AE V AH N T} {EH N JH OY D} {DH AH} {L AE S T} {K AH P AH L} {AH V} {Y IH R Z}.|p323 258 | wav22_silence_trimmed_wav/p323/p323_420_mic1.wav|{DH AH} {K AH M IH SH AH N} {IH Z} {N AA T} {DH AH} {OW N L IY} {L UW Z ER}.|p323 259 | wav22_silence_trimmed_wav/p326/p326_395_mic1.wav|{IH T} {W AA Z} {L AY K} {DH AH} {OW L D} {D EY Z}.|p326 260 | wav22_silence_trimmed_wav/p326/p326_396_mic1.wav|{AH} {G R EY T} {D IY L} {HH AE Z} {B IH N} {AH CH IY V D}.|p326 261 | wav22_silence_trimmed_wav/p326/p326_397_mic1.wav|{M AY} {F Y UW CH ER} {IH Z} {IH N} {DH AH} {M EH R AH TH AA N}.|p326 262 | wav22_silence_trimmed_wav/p329/p329_419_mic1.wav|{W IY V} {K AH M} {F R AH M} {AH} {L AO NG} {W EY} {B AE K}.|p329 263 | wav22_silence_trimmed_wav/p329/p329_420_mic1.wav|{AY V} {B IH N} {AE S K T} {DH AE T} {AO L R EH D IY}, {DH IH S} {M AO R N IH NG} - {AA N} {DH AH} {R EY D IY OW}.|p329 264 | wav22_silence_trimmed_wav/p329/p329_421_mic1.wav|{AY L} {N EH V ER} {TH IH NG K} {AH V} {M AY S EH L F} {AE Z} {AH} {S T AA R}.|p329 265 | wav22_silence_trimmed_wav/p330/p330_418_mic1.wav|{N AW}, {HH AW EH V ER}, {IH T} {HH AE Z} {AH N D ER G AO N} {AH} {D R AH M AE T IH K} {D IH K L AY N}.|p330 266 | wav22_silence_trimmed_wav/p330/p330_419_mic1.wav|{DH IH S} {IH Z} {AH B AW T} {DH AH} {AO R AH N JH} {AO R D ER}.|p330 267 | wav22_silence_trimmed_wav/p330/p330_420_mic1.wav|{B AH T} {IH T} {L UH K S} {G UH D} {F AO R} {N EH K S T} {Y IH R}.|p330 268 | wav22_silence_trimmed_wav/p333/p333_419_mic1.wav|{DH EY} {W ER} {IH N} {G UH D} {HH Y UW M ER}, {T UW}.|p333 269 | wav22_silence_trimmed_wav/p333/p333_420_mic1.wav|{IH T} {Y UW Z D} {T UW} {B AA DH ER} {M IY} {S AH M T AY M Z}, {B AH T} {IH T} {D AH Z AH N T} {EH N IY} {M AO R}.|p333 270 | wav22_silence_trimmed_wav/p333/p333_421_mic1.wav|{TH R IY} {M AH N TH S} {AO R} {S OW} {L EY T ER}, {DH EY} {AA R} {R IH T ER N IH NG}.|p333 271 | wav22_silence_trimmed_wav/p334/p334_419_mic1.wav|{DH IH S} {IH Z} {AH} {TH ER OW L IY} {HH AE N S AH M} {AE N D} {EH N JH OY AH B AH L} {P R AH D AH K SH AH N}.|p334 272 | wav22_silence_trimmed_wav/p334/p334_420_mic1.wav|{N OW} {EH V AH D AH N S} {DH AE T} {IH T} {W AA Z} {OW S AH M AH} {B IH N} {L EY D AH N}.|p334 273 | wav22_silence_trimmed_wav/p334/p334_421_mic1.wav|{AY} {TH IH NG K} {IH T} {IH Z} {T OW T AH L IY} {R AO NG}.|p334 274 | wav22_silence_trimmed_wav/p335/p335_418_mic1.wav|{L AO R D} {S EY N S B EH R IY} {IH Z} {N AA T} {AH} {N UW K AH M ER}.|p335 275 | wav22_silence_trimmed_wav/p335/p335_419_mic1.wav|{HH IY} {IH Z} {N AA T} {V EH R IY} {B IH G}, {IY DH ER}.|p335 276 | wav22_silence_trimmed_wav/p335/p335_420_mic1.wav|{S T IH L}, {DH AH} {IH K S P IH R IY AH N S} {W AA Z} {AH M EY Z IH NG}, {SH IY} {S EH Z}.|p335 277 | wav22_silence_trimmed_wav/p336/p336_419_mic1.wav|{W UH D} {HH IY} {EH V ER} {G OW} {B AE K} {T UW} {DH AH} {B IH G IH N IH NG}?|p336 278 | wav22_silence_trimmed_wav/p336/p336_420_mic1.wav|{W AH T} {IH Z} {G OW IH NG} {AA N}?|p336 279 | wav22_silence_trimmed_wav/p336/p336_421_mic1.wav|{L EH T} {DH AE T} {B IY} {DH EH R} {M AH M AO R IY AH L}.|p336 280 | wav22_silence_trimmed_wav/p339/p339_419_mic1.wav|{W IY} {SH UH D} {HH AE V} {B IH N} {R IH W AO R D IH D} {F AO R} {P R AH D UW S IH NG}.|p339 281 | wav22_silence_trimmed_wav/p339/p339_420_mic1.wav|{AY} {AE M} {AO L S OW} {D IH L AY T AH D} {F AO R} {AO L} {DH AH} {P L EY ER Z}.|p339 282 | wav22_silence_trimmed_wav/p339/p339_421_mic1.wav|{HH IY} {D IH D} {V EH R IY} {W EH L}, {B AH T} {D IH D AH N T} {W IH N}.|p339 283 | wav22_silence_trimmed_wav/p340/p340_419_mic1.wav|{AY L} {N EH V ER} {TH IH NG K} {AH V} {M AY S EH L F} {AE Z} {AH} {S T AA R}.|p340 284 | wav22_silence_trimmed_wav/p340/p340_420_mic1.wav|{IH T} {W AA Z} {W AH N} {N UW} {Y IH R}.|p340 285 | wav22_silence_trimmed_wav/p340/p340_421_mic1.wav|{W IY} {M EY D} {AH} {W ER L D} {R EH K ER D} {T AH G EH DH ER}.|p340 286 | wav22_silence_trimmed_wav/p341/p341_405_mic1.wav|{Y UW} {HH AE V} {T UW} {HH AE V} {S AH M} {HH OW P} {AE N D} {F EY TH}.|p341 287 | wav22_silence_trimmed_wav/p341/p341_406_mic1.wav|{DH EY} {HH OW L D} {AA N} {F AO R} {M EH N IY} {Y IH R Z}.|p341 288 | wav22_silence_trimmed_wav/p341/p341_407_mic1.wav|{HH AW EH V ER}, {DH AH} {R IY P AO R T} {W AA Z} {IH N K ER EH K T}.|p341 289 | wav22_silence_trimmed_wav/p343/p343_395_mic1.wav|{IY CH} {W AH N} {IH Z} {AE Z} {G UH D} {AE Z} {DH AH} {AH DH ER}.|p343 290 | wav22_silence_trimmed_wav/p343/p343_396_mic1.wav|{S AH B} {N AA T} {Y UW Z D}, {M AA R TH AH R}.|p343 291 | wav22_silence_trimmed_wav/p343/p343_397_mic1.wav|{AY} {HH AE V} {HH AA R D L IY} {EH N IY} {K AH M IH T M AH N T S}.|p343 292 | wav22_silence_trimmed_wav/p345/p345_395_mic1.wav|{AY} {W OW N T} {G OW} {B AE K} {T UW} {HH AA L AH N D}.|p345 293 | wav22_silence_trimmed_wav/p345/p345_396_mic1.wav|{DH EH R} {IH Z} {S T IH L} {AH} {L AO NG} {W EY} {T UW} {G OW}.|p345 294 | wav22_silence_trimmed_wav/p345/p345_397_mic1.wav|{AO L M OW S T} {OW V ER N AY T}, {HH IY} {F AW N D} {HH IH Z} {V OY S}.|p345 295 | wav22_silence_trimmed_wav/p347/p347_419_mic1.wav|{DH EY} {W AO N T IH D} {T UW} {SH OW} {W AH T} {DH EY} {K UH D} {D UW}.|p347 296 | wav22_silence_trimmed_wav/p347/p347_420_mic1.wav|{AY} {JH AH S T} {W AA N T} {T UW} {G EH T} {R IH D} {AH V} {IH T}.|p347 297 | wav22_silence_trimmed_wav/p347/p347_421_mic1.wav|{AH F IH SH AH L Z} {S EY} {DH AH} {S IH T IY} {M AH S T} {AH CH IY V} {DH IH S}.|p347 298 | wav22_silence_trimmed_wav/p351/p351_419_mic1.wav|{W AH T S} {IH T} {F AO R}, {AE F T ER} {AO L}?|p351 299 | wav22_silence_trimmed_wav/p351/p351_420_mic1.wav|{W IY} {W IH L} {B IY} {P L IY Z D} {T UW} {T AO K} {T UW} {DH EH M}.|p351 300 | wav22_silence_trimmed_wav/p351/p351_421_mic1.wav|{AH} {N AH M B ER} {AH V} {P IY P AH L} {K UH D} {N AA T} {S P IY K} {IH NG G L IH SH}.|p351 301 | wav22_silence_trimmed_wav/p360/p360_419_mic1.wav|{AY} {D OW N T} {TH IH NG K} {DH EY} {AA R} {T UW} {S IH R IY AH S}.|p360 302 | wav22_silence_trimmed_wav/p360/p360_420_mic1.wav|{T AY G ER} {IH Z} {DH AH} {IH K S EH P SH AH N}.|p360 303 | wav22_silence_trimmed_wav/p360/p360_421_mic1.wav|{DH EH R} {W AA Z} {N OW} {HH AE P IY} {EH N D IH NG}.|p360 304 | wav22_silence_trimmed_wav/p361/p361_419_mic1.wav|{EH V R IY W AH N} {IH N} {B R IH T AH N} {IH Z} {P R AW D} {AH V} {DH IH S} {T IY M}.|p361 305 | wav22_silence_trimmed_wav/p361/p361_420_mic1.wav|{W IY} {S ER T AH N L IY} {D UW} {N AA T} {IH N T EH N D} {T UW} {D IH S B AE N D}.|p361 306 | wav22_silence_trimmed_wav/p361/p361_421_mic1.wav|{M AY} {L AY F} {IH Z} {G OW IH NG} {T UW} {G OW} {AA N}.|p361 307 | wav22_silence_trimmed_wav/p362/p362_397_mic1.wav|{L OW K EY T} {IH N} {S K AA T L AH N D} {AO L S OW} {R AH F Y UW Z D} {T UW} {K AA M EH N T}.|p362 308 | wav22_silence_trimmed_wav/p362/p362_399_mic1.wav|{DH AE T} {IH Z} {W AY} {HH IY} {IH Z} {S OW} {S P EH SH AH L}.|p362 309 | wav22_silence_trimmed_wav/p362/p362_406_mic1.wav|{HH IY} {AE D IH D} {DH AE T} {HH IY} {F EH L T} {AH N D ER M AY N D}.|p362 310 | wav22_silence_trimmed_wav/p363/p363_418_mic1.wav|{OW V ER AO L}, {IH T} {HH AE Z} {S ER T AH N L IY} {W ER K T} {AW T}.|p363 311 | wav22_silence_trimmed_wav/p363/p363_419_mic1.wav|{DH IH S} {IH Z} {AE K CH UW AH L IY} {T R UW}.|p363 312 | wav22_silence_trimmed_wav/p363/p363_420_mic1.wav|{B OW TH} {W ER} {R IH D UW S T} {T UW} {R AH B AH L}.|p363 313 | wav22_silence_trimmed_wav/p364/p364_303_mic1.wav|{IH T} {W AA Z} {DH EH N} {DH AE T} {R EY N JH ER Z} {S K AO R D}.|p364 314 | wav22_silence_trimmed_wav/p364/p364_304_mic1.wav|{DH IH S} {W AA Z} {R IH P IY T IH D} {DH AH} {S EH K AH N D} {T AY M}.|p364 315 | wav22_silence_trimmed_wav/p364/p364_305_mic1.wav|{AY} {CH OW Z} {DH AH} {F AO R M ER}.|p364 316 | wav22_silence_trimmed_wav/p374/p374_419_mic1.wav|{DH EY} {HH AE V} {N AA T} {B IH N} {N EY M D}.|p374 317 | wav22_silence_trimmed_wav/p374/p374_420_mic1.wav|{IH T} {F IY L Z} {L AY K} {T UW} {M AH CH}.|p374 318 | wav22_silence_trimmed_wav/p374/p374_421_mic1.wav|{DH IH S} {IH Z} {DH AH} {L AE S T} {TH IH NG} {W IY} {EH V ER} {IH K S P EH K T AH D}.|p374 319 | wav22_silence_trimmed_wav/p376/p376_419_mic1.wav|{AY V} {N EH V ER} {D AH N} {DH AE T} {B IH F AO R} {IH N} {M AY} {L AY F}.|p376 320 | wav22_silence_trimmed_wav/p376/p376_420_mic1.wav|{AY} {K AE N T} {S T AA P} {P IY P AH L} {TH IH NG K IH NG}.|p376 321 | wav22_silence_trimmed_wav/p376/p376_421_mic1.wav|{DH AE T} {W IH L} {B IY} {B EH S T} {F AO R} {HH IH M}.|p376 322 | wav22_silence_trimmed_wav/s5/s5_395_mic1.wav|{DH AE T} {HH AE D} {HH AE P AH N D} {L AO NG} {B IH F AO R} {HH AE F}-{T AY M} {Y EH S T ER D EY}.|s5 323 | wav22_silence_trimmed_wav/s5/s5_396_mic1.wav|{M EH N IY} {F AA R M ER Z} {K AE N AA T} {IY V IH N} {AH G R IY} {W IH DH IH N} {DH EH R} {OW N} {F AE M AH L IY Z}.|s5 324 | wav22_silence_trimmed_wav/s5/s5_397_mic1.wav|{N UW} {Y AO R K} {P R AH V AY D AH D} {L IH T AH L} {HH EH L P} {F AO R} {DH AH} {L AH N D AH N} {M AA R K AH T}.|s5 325 | -------------------------------------------------------------------------------- /inference.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | warnings.filterwarnings(action='ignore') 3 | 4 | import os 5 | import utils 6 | import argparse 7 | import torch 8 | from models import SynthesizerTrn 9 | from text.symbols import symbol_len 10 | from data_utils import infer_text_process 11 | 12 | import soundcard as sc 13 | import soundfile as sf 14 | 15 | import time 16 | 17 | def inference(args): 18 | 19 | hps = utils.get_hparams(args) 20 | device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 21 | torch.manual_seed(hps.train.seed) 22 | 23 | net_g = SynthesizerTrn(symbol_len(hps.data.languages), 24 | hps.data.filter_length // 2 + 1, 25 | hps.train.segment_size // hps.data.hop_length, 26 | n_speakers=len(hps.data.speakers), 27 | midi_start=hps.data.midi_start, 28 | midi_end=hps.data.midi_end, 29 | octave_range=hps.data.octave_range, 30 | **hps.model, 31 | ### add ### 32 | sr=hps.data.sampling_rate, 33 | W=hps.data.ying_window, 34 | w_step=hps.data.ying_hop, 35 | tau_max=hps.data.tau_max 36 | ########### 37 | ).to(device) 38 | 39 | # load checkpoint 40 | utils.load_checkpoint(args.model_path, model_g=net_g) 41 | 42 | # play audio by system default 43 | speaker = sc.get_speaker(sc.default_speaker().name) 44 | s_name_list = hps.data.speakers 45 | 46 | # parameter settings 47 | noise_scale = torch.tensor(0.66) # adjust z_p noise 48 | noise_scale_w = torch.tensor(0.66) # adjust SDP noise 49 | length_scale = torch.tensor(1.0) # adjust sound length scale (talk speed) 50 | max_len = None # frame max length 51 | 52 | if args.is_save is True: 53 | n_save = 0 54 | save_dir = os.path.join("./infer_logs/", args.model) 55 | os.makedirs(save_dir, exist_ok=True) 56 | 57 | while True: 58 | 59 | print("\n") 60 | 61 | # get speaker id 62 | speaker_name = input("Enter speaker name. ==> ") 63 | if speaker_name=="": 64 | print("Empty input is detected... Exit...") 65 | break 66 | try: 67 | sid = int(s_name_list.index(speaker_name)) 68 | except: 69 | while True: 70 | print(f"{speaker_name} does not existed. The list of speakers is displayed below.") 71 | print(s_name_list) 72 | speaker_name = input("Enter speaker name. ==> ") 73 | try: 74 | sid = int(s_name_list.index(speaker_name)) 75 | break 76 | except: 77 | print("~RETRY~") 78 | pass 79 | sid = torch.unsqueeze(torch.tensor(sid, dtype=torch.int64), dim=0) 80 | 81 | # get text 82 | text = input("Enter text. ==> ") 83 | if text=="": 84 | print("Empty input is detected... Exit...") 85 | break 86 | 87 | # get shift pitch 88 | scope = input("Enter pitch shift value(int). -15〜15 ==> ") 89 | if scope=="": 90 | print("Empty input is detected... Exit...") 91 | break 92 | try: 93 | scope = int(scope) 94 | except: 95 | while True: 96 | print(f"Enter an integer value. {scope} is not integer value.") 97 | scope = input("Enter pitch shift value(int). -15〜15 ==> ") 98 | try: 99 | scope = int(scope) 100 | break 101 | except: 102 | print("~RETRY~") 103 | pass 104 | if scope > 15: 105 | scope = 15 106 | print("The upper limit of pitch shift value is 15. 15 is entered.") 107 | elif scope < -15: 108 | scope = -15 109 | print("The lower limit of pitch shift value is -15. -15 is entered.") 110 | scope_shift = torch.tensor(scope, dtype=torch.int64) # pitch adjust 111 | 112 | # measure the execution time 113 | torch.cuda.synchronize() 114 | start = time.time() 115 | 116 | # required_grad is False 117 | with torch.inference_mode(): 118 | x, t, x_lengths = infer_text_process(text=text, hps=hps) 119 | 120 | # generate audio 121 | y_hat, _, _, _ = net_g.infer(x=x.to(device), 122 | t=t.to(device), 123 | x_lengths=x_lengths.to(device), 124 | sid=sid.to(device), 125 | noise_scale=noise_scale.to(device), 126 | noise_scale_w=noise_scale_w.to(device), 127 | length_scale=length_scale.to(device), 128 | max_len=max_len, 129 | scope_shift=scope_shift.to(device)) 130 | 131 | # measure the execution time 132 | torch.cuda.synchronize() 133 | elapsed_time = time.time() - start 134 | print(f"Gen Time : {elapsed_time}") 135 | 136 | # play audio 137 | speaker.play(y_hat.view(-1).to('cpu').detach().numpy().copy(), hps.data.sampling_rate) 138 | 139 | # save audio 140 | if args.is_save is True: 141 | n_save += 1 142 | save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{speaker_name}_scope={scope}_{text}.wav") 143 | data = y_hat.view(-1).to('cpu').detach().numpy().copy() 144 | sf.write( 145 | file=save_path, 146 | data=data, 147 | samplerate=hps.data.sampling_rate, 148 | format="WAV") 149 | print(f"Audio is saved at : {save_path}") 150 | 151 | 152 | return 0 153 | 154 | if __name__ == "__main__": 155 | 156 | parser = argparse.ArgumentParser() 157 | parser.add_argument('--config', 158 | type=str, 159 | required=True, 160 | help='Path to configuration file') 161 | parser.add_argument('--model', 162 | type=str, 163 | required=True, 164 | help='Model name') 165 | parser.add_argument('--model_path', 166 | type=str, 167 | required=True, 168 | help='Path to checkpoint') 169 | parser.add_argument('--is_save', 170 | type=str, 171 | default=True, 172 | help='Whether to save output or not') 173 | args = parser.parse_args() 174 | 175 | inference(args) -------------------------------------------------------------------------------- /losses.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import torch 3 | from torch.autograd import Function 4 | 5 | 6 | def feature_loss(fmap_r, fmap_g): 7 | loss = 0 8 | for dr, dg in zip(fmap_r, fmap_g): 9 | for rl, gl in zip(dr, dg): 10 | rl = rl.float().detach() 11 | gl = gl.float() 12 | loss += torch.mean(torch.abs(rl - gl)) 13 | 14 | return loss * 2 15 | 16 | 17 | def discriminator_loss(disc_real_outputs, disc_generated_outputs): 18 | loss = 0 19 | r_losses = [] 20 | g_losses = [] 21 | for dr, dg in zip(disc_real_outputs, disc_generated_outputs): 22 | dr = dr.float() 23 | dg = dg.float() 24 | r_loss = torch.mean((1-dr)**2) 25 | g_loss = torch.mean(dg**2) 26 | loss += (r_loss + g_loss) 27 | r_losses.append(r_loss.item()) 28 | g_losses.append(g_loss.item()) 29 | 30 | return loss, r_losses, g_losses 31 | 32 | 33 | def generator_loss(disc_outputs): 34 | loss = 0 35 | gen_losses = [] 36 | for dg in disc_outputs: 37 | dg = dg.float() 38 | l = torch.mean((1-dg)**2) 39 | gen_losses.append(l) 40 | loss += l 41 | 42 | return loss, gen_losses 43 | 44 | 45 | def kl_loss(z_p, logs, m_p, logs_p, z_mask): 46 | """ 47 | z_p, logs: [b, h, t_t] 48 | m_p, logs_p: [b, h, t_t] 49 | """ 50 | z_p = z_p.float() 51 | logs = logs.float() 52 | m_p = m_p.float() 53 | logs_p = logs_p.float() 54 | z_mask = z_mask.float() 55 | 56 | kl = logs_p - logs - 0.5 57 | kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p) 58 | kl = torch.sum(kl * z_mask) 59 | l = kl / torch.sum(z_mask) 60 | return l 61 | 62 | 63 | class ReverseLayerF(Function): 64 | 65 | @staticmethod 66 | def forward(ctx, x, alpha): 67 | ctx.alpha = alpha 68 | 69 | return x.view_as(x) 70 | 71 | @staticmethod 72 | def backward(ctx, grad_output): 73 | output = grad_output.neg() * ctx.alpha 74 | 75 | return output, None 76 | -------------------------------------------------------------------------------- /mel_processing.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import torch 3 | import torch.utils.data 4 | from librosa.filters import mel as librosa_mel_fn 5 | from torch.cuda.amp import autocast 6 | 7 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5): 8 | """ 9 | PARAMS 10 | ------ 11 | C: compression factor 12 | """ 13 | return torch.log(torch.clamp(x, min=clip_val) * C) 14 | 15 | 16 | def dynamic_range_decompression_torch(x, C=1): 17 | """ 18 | PARAMS 19 | ------ 20 | C: compression factor used to compress 21 | """ 22 | return torch.exp(x) / C 23 | 24 | 25 | def spectral_normalize_torch(magnitudes): 26 | output = dynamic_range_compression_torch(magnitudes) 27 | return output 28 | 29 | 30 | def spectral_de_normalize_torch(magnitudes): 31 | output = dynamic_range_decompression_torch(magnitudes) 32 | return output 33 | 34 | 35 | mel_basis = {} 36 | hann_window = {} 37 | 38 | 39 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False): 40 | if torch.min(y) < -1.: 41 | print('min value is ', torch.min(y)) 42 | if torch.max(y) > 1.: 43 | print('max value is ', torch.max(y)) 44 | 45 | global hann_window 46 | dtype_device = str(y.dtype) + '_' + str(y.device) 47 | wnsize_dtype_device = str(win_size) + '_' + dtype_device 48 | if wnsize_dtype_device not in hann_window: 49 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device) 50 | 51 | y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') 52 | y = y.squeeze(1) 53 | with autocast(enabled=False): 54 | y=y.float() 55 | spec = torch.stft( 56 | y, 57 | n_fft, 58 | hop_length=hop_size, 59 | win_length=win_size, 60 | window=hann_window[wnsize_dtype_device], 61 | center=center, 62 | pad_mode='reflect', 63 | normalized=False, 64 | onesided=True, 65 | return_complex=False 66 | ) 67 | 68 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 69 | return spec 70 | 71 | 72 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax): 73 | global mel_basis 74 | dtype_device = str(spec.dtype) + '_' + str(spec.device) 75 | fmax_dtype_device = str(fmax) + '_' + dtype_device 76 | if fmax_dtype_device not in mel_basis: 77 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 78 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=spec.dtype, device=spec.device) 79 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 80 | spec = spectral_normalize_torch(spec) 81 | return spec 82 | 83 | 84 | def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False): 85 | if torch.min(y) < -1.: 86 | print('min value is ', torch.min(y)) 87 | if torch.max(y) > 1.: 88 | print('max value is ', torch.max(y)) 89 | 90 | global mel_basis, hann_window 91 | dtype_device = str(y.dtype) + '_' + str(y.device) 92 | fmax_dtype_device = str(fmax) + '_' + dtype_device 93 | wnsize_dtype_device = str(win_size) + '_' + dtype_device 94 | if fmax_dtype_device not in mel_basis: 95 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 96 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device) 97 | if wnsize_dtype_device not in hann_window: 98 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device) 99 | 100 | y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') 101 | y = y.squeeze(1) 102 | with autocast(enabled=False): 103 | y=y.float() 104 | spec = torch.stft( 105 | y, 106 | n_fft, 107 | hop_length=hop_size, 108 | win_length=win_size, 109 | window=hann_window[wnsize_dtype_device], 110 | center=center, 111 | pad_mode='reflect', 112 | normalized=False, 113 | onesided=True, 114 | return_complex=False 115 | ) 116 | 117 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 118 | 119 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 120 | spec = spectral_normalize_torch(spec) 121 | 122 | return spec 123 | -------------------------------------------------------------------------------- /metadata_cleaners.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | from text import cleaners 5 | 6 | 7 | def main(args): 8 | cleaner = getattr(cleaners, args.cleaner) 9 | if not cleaner: 10 | raise Exception('Unknown cleaner: %s' % args.cleaner) 11 | 12 | train_path = args.train_path 13 | train_metadata = list() 14 | 15 | val_path = args.validation_path 16 | val_metadata = list() 17 | 18 | with open(train_path, "r", encoding="utf-8") as f: 19 | for line in f: 20 | wav_path, text, speaker = line.strip("\n").split("|") 21 | meta = [wav_path, cleaner(text), speaker] 22 | train_metadata.append(meta) 23 | 24 | path, ext = os.path.splitext(train_path) 25 | with open(path + "_cleaned" + ext, "w", encoding="utf-8") as f: 26 | for meta in train_metadata: 27 | f.write("|".join(meta) + "\n") 28 | 29 | if val_path is not None: 30 | with open(val_path, "r", encoding="utf-8") as f: 31 | for line in f: 32 | wav_path, text, speaker = line.strip("\n").split("|") 33 | meta = [wav_path, cleaner(text), speaker] 34 | val_metadata.append(meta) 35 | 36 | path, ext = os.path.splitext(val_path) 37 | with open(path + "_cleaned" + ext, "w", encoding="utf-8") as f: 38 | for meta in val_metadata: 39 | f.write("|".join(meta) + "\n") 40 | 41 | 42 | if __name__ == "__main__": 43 | parser = argparse.ArgumentParser() 44 | parser.add_argument( 45 | "-t", 46 | "--train_path", 47 | type=str, 48 | required=True, 49 | help="path to train meatadata", 50 | ) 51 | parser.add_argument( 52 | "-v", 53 | "--validation_path", 54 | type=str, 55 | default=None, 56 | help="path to validation meatadata", 57 | ) 58 | parser.add_argument( 59 | "-c", 60 | "--cleaner", 61 | type=str, 62 | default='korean_cleaners', 63 | help="cleaner for text cleaning", 64 | ) 65 | 66 | args = parser.parse_args() 67 | main(args) 68 | -------------------------------------------------------------------------------- /modules.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import math 3 | import torch 4 | from torch import nn 5 | from torch.nn import Conv1d 6 | from torch.nn import functional as F 7 | from torch.nn.utils import weight_norm, remove_weight_norm 8 | 9 | import commons 10 | from commons import init_weights, get_padding 11 | from transforms import piecewise_rational_quadratic_transform 12 | 13 | 14 | LRELU_SLOPE = 0.1 15 | 16 | 17 | class LayerNorm(nn.Module): 18 | def __init__(self, channels, eps=1e-5): 19 | super().__init__() 20 | self.channels = channels 21 | self.eps = eps 22 | 23 | self.gamma = nn.Parameter(torch.ones(channels)) 24 | self.beta = nn.Parameter(torch.zeros(channels)) 25 | 26 | def forward(self, x): 27 | x = x.transpose(1, -1) 28 | x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps) 29 | return x.transpose(1, -1) 30 | 31 | 32 | class ConvReluNorm(nn.Module): 33 | def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout): 34 | super().__init__() 35 | self.in_channels = in_channels 36 | self.hidden_channels = hidden_channels 37 | self.out_channels = out_channels 38 | self.kernel_size = kernel_size 39 | self.n_layers = n_layers 40 | self.p_dropout = p_dropout 41 | assert n_layers > 1, "Number of layers should be larger than 0." 42 | 43 | self.conv_layers = nn.ModuleList() 44 | self.norm_layers = nn.ModuleList() 45 | self.conv_layers.append( 46 | nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size//2) 47 | ) 48 | self.norm_layers.append(LayerNorm(hidden_channels)) 49 | self.relu_drop = nn.Sequential( 50 | nn.ReLU(), 51 | nn.Dropout(p_dropout)) 52 | for _ in range(n_layers-1): 53 | self.conv_layers.append(nn.Conv1d( 54 | hidden_channels, hidden_channels, kernel_size, padding=kernel_size//2) 55 | ) 56 | self.norm_layers.append(LayerNorm(hidden_channels)) 57 | self.proj = nn.Conv1d(hidden_channels, out_channels, 1) 58 | self.proj.weight.data.zero_() 59 | self.proj.bias.data.zero_() 60 | 61 | def forward(self, x, x_mask): 62 | x_org = x 63 | for i in range(self.n_layers): 64 | x = self.conv_layers[i](x * x_mask) 65 | x = self.norm_layers[i](x) 66 | x = self.relu_drop(x) 67 | x = x_org + self.proj(x) 68 | return x * x_mask 69 | 70 | 71 | class DDSConv(nn.Module): 72 | """Dialted and Depth-Separable Convolution""" 73 | def __init__(self, channels, kernel_size, n_layers, p_dropout=0.): 74 | super().__init__() 75 | self.channels = channels 76 | self.kernel_size = kernel_size 77 | self.n_layers = n_layers 78 | self.p_dropout = p_dropout 79 | 80 | self.drop = nn.Dropout(p_dropout) 81 | self.convs_sep = nn.ModuleList() 82 | self.convs_1x1 = nn.ModuleList() 83 | self.norms_1 = nn.ModuleList() 84 | self.norms_2 = nn.ModuleList() 85 | for i in range(n_layers): 86 | dilation = kernel_size ** i 87 | padding = (kernel_size * dilation - dilation) // 2 88 | self.convs_sep.append( 89 | nn.Conv1d( 90 | channels, 91 | channels, 92 | kernel_size, 93 | groups=channels, 94 | dilation=dilation, 95 | padding=padding 96 | ) 97 | ) 98 | self.convs_1x1.append(nn.Conv1d(channels, channels, 1)) 99 | self.norms_1.append(LayerNorm(channels)) 100 | self.norms_2.append(LayerNorm(channels)) 101 | 102 | def forward(self, x, x_mask, g=None): 103 | if g is not None: 104 | x = x + g 105 | for i in range(self.n_layers): 106 | y = self.convs_sep[i](x * x_mask) 107 | y = self.norms_1[i](y) 108 | y = F.gelu(y) 109 | y = self.convs_1x1[i](y) 110 | y = self.norms_2[i](y) 111 | y = F.gelu(y) 112 | y = self.drop(y) 113 | x = x + y 114 | return x * x_mask 115 | 116 | 117 | class WN(torch.nn.Module): 118 | def __init__(self, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0): 119 | super(WN, self).__init__() 120 | assert(kernel_size % 2 == 1) 121 | self.hidden_channels = hidden_channels 122 | self.kernel_size = kernel_size, 123 | self.dilation_rate = dilation_rate 124 | self.n_layers = n_layers 125 | self.gin_channels = gin_channels 126 | self.p_dropout = p_dropout 127 | 128 | self.in_layers = torch.nn.ModuleList() 129 | self.res_skip_layers = torch.nn.ModuleList() 130 | self.drop = nn.Dropout(p_dropout) 131 | 132 | if gin_channels != 0: 133 | cond_layer = torch.nn.Conv1d(gin_channels, 2*hidden_channels*n_layers, 1) 134 | self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight') 135 | 136 | for i in range(n_layers): 137 | dilation = dilation_rate ** i 138 | padding = int((kernel_size * dilation - dilation) / 2) 139 | in_layer = torch.nn.Conv1d( 140 | hidden_channels, 141 | 2*hidden_channels, 142 | kernel_size, 143 | dilation=dilation, 144 | padding=padding 145 | ) 146 | in_layer = torch.nn.utils.weight_norm(in_layer, name='weight') 147 | self.in_layers.append(in_layer) 148 | 149 | # last one is not necessary 150 | if i < n_layers - 1: 151 | res_skip_channels = 2 * hidden_channels 152 | else: 153 | res_skip_channels = hidden_channels 154 | 155 | res_skip_layer = torch.nn.Conv1d( 156 | hidden_channels, res_skip_channels, 1 157 | ) 158 | res_skip_layer = torch.nn.utils.weight_norm( 159 | res_skip_layer, name='weight' 160 | ) 161 | self.res_skip_layers.append(res_skip_layer) 162 | 163 | def forward(self, x, x_mask, g=None, **kwargs): 164 | output = torch.zeros_like(x) 165 | n_channels_tensor = torch.IntTensor([self.hidden_channels]) 166 | 167 | if g is not None: 168 | g = self.cond_layer(g) 169 | 170 | for i in range(self.n_layers): 171 | x_in = self.in_layers[i](x) 172 | if g is not None: 173 | cond_offset = i * 2 * self.hidden_channels 174 | g_l = g[:, cond_offset:cond_offset+2*self.hidden_channels, :] 175 | else: 176 | g_l = torch.zeros_like(x_in) 177 | 178 | acts = commons.fused_add_tanh_sigmoid_multiply( 179 | x_in, 180 | g_l, 181 | n_channels_tensor 182 | ) 183 | acts = self.drop(acts) 184 | 185 | res_skip_acts = self.res_skip_layers[i](acts) 186 | if i < self.n_layers - 1: 187 | res_acts = res_skip_acts[:, :self.hidden_channels, :] 188 | x = (x + res_acts) * x_mask 189 | output = output + res_skip_acts[:, self.hidden_channels:, :] 190 | else: 191 | output = output + res_skip_acts 192 | return output * x_mask 193 | 194 | def remove_weight_norm(self): 195 | if self.gin_channels != 0: 196 | torch.nn.utils.remove_weight_norm(self.cond_layer) 197 | for l in self.in_layers: 198 | torch.nn.utils.remove_weight_norm(l) 199 | for l in self.res_skip_layers: 200 | torch.nn.utils.remove_weight_norm(l) 201 | 202 | 203 | class ResBlock1(torch.nn.Module): 204 | def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)): 205 | super(ResBlock1, self).__init__() 206 | self.convs1 = nn.ModuleList([ 207 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0], 208 | padding=get_padding(kernel_size, dilation[0]))), 209 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1], 210 | padding=get_padding(kernel_size, dilation[1]))), 211 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2], 212 | padding=get_padding(kernel_size, dilation[2]))) 213 | ]) 214 | self.convs1.apply(init_weights) 215 | 216 | self.convs2 = nn.ModuleList([ 217 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, 218 | padding=get_padding(kernel_size, 1))), 219 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, 220 | padding=get_padding(kernel_size, 1))), 221 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, 222 | padding=get_padding(kernel_size, 1))) 223 | ]) 224 | self.convs2.apply(init_weights) 225 | 226 | def forward(self, x, x_mask=None): 227 | for c1, c2 in zip(self.convs1, self.convs2): 228 | xt = F.leaky_relu(x, LRELU_SLOPE) 229 | if x_mask is not None: 230 | xt = xt * x_mask 231 | xt = c1(xt) 232 | xt = F.leaky_relu(xt, LRELU_SLOPE) 233 | if x_mask is not None: 234 | xt = xt * x_mask 235 | xt = c2(xt) 236 | x = xt + x 237 | if x_mask is not None: 238 | x = x * x_mask 239 | return x 240 | 241 | def remove_weight_norm(self): 242 | for l in self.convs1: 243 | remove_weight_norm(l) 244 | for l in self.convs2: 245 | remove_weight_norm(l) 246 | 247 | 248 | class ResBlock2(torch.nn.Module): 249 | def __init__(self, channels, kernel_size=3, dilation=(1, 3)): 250 | super(ResBlock2, self).__init__() 251 | self.convs = nn.ModuleList([ 252 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0], 253 | padding=get_padding(kernel_size, dilation[0]))), 254 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1], 255 | padding=get_padding(kernel_size, dilation[1]))) 256 | ]) 257 | self.convs.apply(init_weights) 258 | 259 | def forward(self, x, x_mask=None): 260 | for c in self.convs: 261 | xt = F.leaky_relu(x, LRELU_SLOPE) 262 | if x_mask is not None: 263 | xt = xt * x_mask 264 | xt = c(xt) 265 | x = xt + x 266 | if x_mask is not None: 267 | x = x * x_mask 268 | return x 269 | 270 | def remove_weight_norm(self): 271 | for l in self.convs: 272 | remove_weight_norm(l) 273 | 274 | 275 | class Log(nn.Module): 276 | def forward(self, x, x_mask, reverse=False, **kwargs): 277 | if not reverse: 278 | y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask 279 | logdet = torch.sum(-y, [1, 2]) 280 | return y, logdet 281 | else: 282 | x = torch.exp(x) * x_mask 283 | return x 284 | 285 | 286 | class Flip(nn.Module): 287 | def forward(self, x, *args, reverse=False, **kwargs): 288 | x = torch.flip(x, [1]) 289 | if not reverse: 290 | logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device) 291 | return x, logdet 292 | else: 293 | return x 294 | 295 | 296 | class ElementwiseAffine(nn.Module): 297 | def __init__(self, channels): 298 | super().__init__() 299 | self.channels = channels 300 | self.m = nn.Parameter(torch.zeros(channels, 1)) 301 | self.logs = nn.Parameter(torch.zeros(channels, 1)) 302 | 303 | def forward(self, x, x_mask, reverse=False, **kwargs): 304 | if not reverse: 305 | y = self.m + torch.exp(self.logs) * x 306 | y = y * x_mask 307 | logdet = torch.sum(self.logs * x_mask, [1, 2]) 308 | return y, logdet 309 | else: 310 | x = (x - self.m) * torch.exp(-self.logs) * x_mask 311 | return x 312 | 313 | 314 | class ResidualCouplingLayer(nn.Module): 315 | def __init__( 316 | self, 317 | channels, 318 | hidden_channels, 319 | kernel_size, 320 | dilation_rate, 321 | n_layers, 322 | p_dropout=0, 323 | gin_channels=0, 324 | mean_only=False 325 | ): 326 | assert channels % 2 == 0, "channels should be divisible by 2" 327 | super().__init__() 328 | self.channels = channels 329 | self.hidden_channels = hidden_channels 330 | self.kernel_size = kernel_size 331 | self.dilation_rate = dilation_rate 332 | self.n_layers = n_layers 333 | self.half_channels = channels // 2 334 | self.mean_only = mean_only 335 | 336 | self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1) 337 | self.enc = WN( 338 | hidden_channels, 339 | kernel_size, 340 | dilation_rate, 341 | n_layers, 342 | p_dropout=p_dropout, 343 | gin_channels=gin_channels 344 | ) 345 | self.post = nn.Conv1d( 346 | hidden_channels, self.half_channels * (2 - mean_only), 1 347 | ) 348 | self.post.weight.data.zero_() 349 | self.post.bias.data.zero_() 350 | 351 | def forward(self, x, x_mask, g=None, reverse=False): 352 | x0, x1 = torch.split(x, [self.half_channels] * 2, 1) 353 | h = self.pre(x0) * x_mask 354 | h = self.enc(h, x_mask, g=g) 355 | stats = self.post(h) * x_mask 356 | if not self.mean_only: 357 | m, logs = torch.split(stats, [self.half_channels] * 2, 1) 358 | else: 359 | m = stats 360 | logs = torch.zeros_like(m) 361 | 362 | if not reverse: 363 | x1 = m + x1 * torch.exp(logs) * x_mask 364 | x = torch.cat([x0, x1], 1) 365 | logdet = torch.sum(logs, [1, 2]) 366 | return x, logdet 367 | else: 368 | x1 = (x1 - m) * torch.exp(-logs) * x_mask 369 | x = torch.cat([x0, x1], 1) 370 | return x 371 | 372 | 373 | class ConvFlow(nn.Module): 374 | def __init__(self, in_channels, filter_channels, kernel_size, n_layers, num_bins=10, tail_bound=5.0): 375 | super().__init__() 376 | self.in_channels = in_channels 377 | self.filter_channels = filter_channels 378 | self.kernel_size = kernel_size 379 | self.n_layers = n_layers 380 | self.num_bins = num_bins 381 | self.tail_bound = tail_bound 382 | self.half_channels = in_channels // 2 383 | 384 | self.pre = nn.Conv1d(self.half_channels, filter_channels, 1) 385 | self.convs = DDSConv( 386 | filter_channels, kernel_size, n_layers, p_dropout=0. 387 | ) 388 | self.proj = nn.Conv1d( 389 | filter_channels, self.half_channels * (num_bins * 3 - 1), 1 390 | ) 391 | self.proj.weight.data.zero_() 392 | self.proj.bias.data.zero_() 393 | 394 | def forward(self, x, x_mask, g=None, reverse=False): 395 | x0, x1 = torch.split(x, [self.half_channels]*2, 1) 396 | h = self.pre(x0) 397 | h = self.convs(h, x_mask, g=g) 398 | h = self.proj(h) * x_mask 399 | 400 | b, c, t = x0.shape 401 | # [b, cx?, t] -> [b, c, t, ?] 402 | h = h.reshape(b, c, -1, t).permute(0, 1, 3, 2) 403 | 404 | unnormalized_widths = h[..., :self.num_bins] / \ 405 | math.sqrt(self.filter_channels) 406 | unnormalized_heights = h[..., self.num_bins:2 * self.num_bins] / \ 407 | math.sqrt(self.filter_channels) 408 | unnormalized_derivatives = h[..., 2 * self.num_bins:] 409 | 410 | x1, logabsdet = piecewise_rational_quadratic_transform( 411 | x1, 412 | unnormalized_widths, 413 | unnormalized_heights, 414 | unnormalized_derivatives, 415 | inverse=reverse, 416 | tails='linear', 417 | tail_bound=self.tail_bound 418 | ) 419 | 420 | x = torch.cat([x0, x1], 1) * x_mask 421 | logdet = torch.sum(logabsdet * x_mask, [1, 2]) 422 | if not reverse: 423 | return x, logdet 424 | else: 425 | return x 426 | -------------------------------------------------------------------------------- /monotonic_align/__init__.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from .monotonic_align.core import maximum_path_c 4 | 5 | 6 | def maximum_path(neg_cent, mask): 7 | """ Cython optimized version. 8 | neg_cent: [b, t_t, t_s] 9 | mask: [b, t_t, t_s] 10 | """ 11 | device = neg_cent.device 12 | dtype = neg_cent.dtype 13 | neg_cent = neg_cent.data.cpu().numpy().astype(np.float32) 14 | path = np.zeros(neg_cent.shape, dtype=np.int32) 15 | 16 | t_t_max = mask.sum(1)[:, 0].data.cpu().numpy().astype(np.int32) 17 | t_s_max = mask.sum(2)[:, 0].data.cpu().numpy().astype(np.int32) 18 | maximum_path_c(path, neg_cent, t_t_max, t_s_max) 19 | return torch.from_numpy(path).to(device=device, dtype=dtype) 20 | -------------------------------------------------------------------------------- /monotonic_align/core.pyx: -------------------------------------------------------------------------------- 1 | cimport cython 2 | from cython.parallel import prange 3 | 4 | 5 | @cython.boundscheck(False) 6 | @cython.wraparound(False) 7 | cdef void maximum_path_each(int[:,::1] path, float[:,::1] value, int t_y, int t_x, float max_neg_val=-1e9) nogil: 8 | cdef int x 9 | cdef int y 10 | cdef float v_prev 11 | cdef float v_cur 12 | cdef float tmp 13 | cdef int index = t_x - 1 14 | 15 | for y in range(t_y): 16 | for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)): 17 | if x == y: 18 | v_cur = max_neg_val 19 | else: 20 | v_cur = value[y-1, x] 21 | if x == 0: 22 | if y == 0: 23 | v_prev = 0. 24 | else: 25 | v_prev = max_neg_val 26 | else: 27 | v_prev = value[y-1, x-1] 28 | value[y, x] += max(v_prev, v_cur) 29 | 30 | for y in range(t_y - 1, -1, -1): 31 | path[y, index] = 1 32 | if index != 0 and (index == y or value[y-1, index] < value[y-1, index-1]): 33 | index = index - 1 34 | 35 | 36 | @cython.boundscheck(False) 37 | @cython.wraparound(False) 38 | cpdef void maximum_path_c(int[:,:,::1] paths, float[:,:,::1] values, int[::1] t_ys, int[::1] t_xs) nogil: 39 | cdef int b = paths.shape[0] 40 | cdef int i 41 | for i in prange(b, nogil=True): 42 | maximum_path_each(paths[i], values[i], t_ys[i], t_xs[i]) 43 | -------------------------------------------------------------------------------- /monotonic_align/setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | from Cython.Build import cythonize 3 | import numpy 4 | 5 | setup( 6 | name='monotonic_align', 7 | ext_modules=cythonize("core.pyx"), 8 | include_dirs=[numpy.get_include()] 9 | ) 10 | -------------------------------------------------------------------------------- /pqmf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Copyright 2020 Tomoki Hayashi 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | """Pseudo QMF modules.""" 7 | ''' 8 | Copied from https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/parallel_wavegan/layers/pqmf.py 9 | ''' 10 | 11 | import numpy as np 12 | import torch 13 | import torch.nn.functional as F 14 | 15 | from scipy.signal import kaiser 16 | 17 | 18 | def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0): 19 | """Design prototype filter for PQMF. 20 | This method is based on `A Kaiser window approach for the design of prototype 21 | filters of cosine modulated filterbanks`_. 22 | Args: 23 | taps (int): The number of filter taps. 24 | cutoff_ratio (float): Cut-off frequency ratio. 25 | beta (float): Beta coefficient for kaiser window. 26 | Returns: 27 | ndarray: Impluse response of prototype filter (taps + 1,). 28 | .. _`A Kaiser window approach for the design of prototype filters of cosine modulated filterbanks`: 29 | https://ieeexplore.ieee.org/abstract/document/681427 30 | """ 31 | # check the arguments are valid 32 | assert taps % 2 == 0, "The number of taps mush be even number." 33 | assert 0.0 < cutoff_ratio < 1.0, "Cutoff ratio must be > 0.0 and < 1.0." 34 | 35 | # make initial filter 36 | omega_c = np.pi * cutoff_ratio 37 | with np.errstate(invalid="ignore"): 38 | h_i = np.sin(omega_c * (np.arange(taps + 1) - 0.5 * taps)) / ( 39 | np.pi * (np.arange(taps + 1) - 0.5 * taps) 40 | ) 41 | h_i[taps // 2] = np.cos(0) * cutoff_ratio # fix nan due to indeterminate form 42 | 43 | # apply kaiser window 44 | w = kaiser(taps + 1, beta) 45 | h = h_i * w 46 | 47 | return h 48 | 49 | 50 | class PQMF(torch.nn.Module): 51 | """PQMF module. 52 | This module is based on `Near-perfect-reconstruction pseudo-QMF banks`_. 53 | .. _`Near-perfect-reconstruction pseudo-QMF banks`: 54 | https://ieeexplore.ieee.org/document/258122 55 | """ 56 | 57 | def __init__(self, subbands=4, taps=62, cutoff_ratio=0.142, beta=9.0): 58 | """Initilize PQMF module. 59 | The cutoff_ratio and beta parameters are optimized for #subbands = 4. 60 | See dicussion in https://github.com/kan-bayashi/ParallelWaveGAN/issues/195. 61 | Args: 62 | subbands (int): The number of subbands. 63 | taps (int): The number of filter taps. 64 | cutoff_ratio (float): Cut-off frequency ratio. 65 | beta (float): Beta coefficient for kaiser window. 66 | """ 67 | super(PQMF, self).__init__() 68 | 69 | # build analysis & synthesis filter coefficients 70 | h_proto = design_prototype_filter(taps, cutoff_ratio, beta) 71 | h_analysis = np.zeros((subbands, len(h_proto))) 72 | h_synthesis = np.zeros((subbands, len(h_proto))) 73 | for k in range(subbands): 74 | h_analysis[k] = ( 75 | 2 76 | * h_proto 77 | * np.cos( 78 | (2 * k + 1) 79 | * (np.pi / (2 * subbands)) 80 | * (np.arange(taps + 1) - (taps / 2)) 81 | + (-1) ** k * np.pi / 4 82 | ) 83 | ) 84 | h_synthesis[k] = ( 85 | 2 86 | * h_proto 87 | * np.cos( 88 | (2 * k + 1) 89 | * (np.pi / (2 * subbands)) 90 | * (np.arange(taps + 1) - (taps / 2)) 91 | - (-1) ** k * np.pi / 4 92 | ) 93 | ) 94 | 95 | # convert to tensor 96 | analysis_filter = torch.Tensor(h_analysis).float().unsqueeze(1) 97 | synthesis_filter = torch.Tensor(h_synthesis).float().unsqueeze(0) 98 | 99 | # register coefficients as beffer 100 | self.register_buffer("analysis_filter", analysis_filter) 101 | self.register_buffer("synthesis_filter", synthesis_filter) 102 | 103 | # filter for downsampling & upsampling 104 | updown_filter = torch.zeros((subbands, subbands, subbands)).float() 105 | for k in range(subbands): 106 | updown_filter[k, k, 0] = 1.0 107 | self.register_buffer("updown_filter", updown_filter) 108 | self.subbands = subbands 109 | 110 | # keep padding info 111 | self.pad_fn = torch.nn.ConstantPad1d(taps // 2, 0.0) 112 | 113 | def analysis(self, x): 114 | """Analysis with PQMF. 115 | Args: 116 | x (Tensor): Input tensor (B, 1, T). 117 | Returns: 118 | Tensor: Output tensor (B, subbands, T // subbands). 119 | """ 120 | x = F.conv1d(self.pad_fn(x), self.analysis_filter) 121 | return F.conv1d(x, self.updown_filter, stride=self.subbands) 122 | 123 | def synthesis(self, x): 124 | """Synthesis with PQMF. 125 | Args: 126 | x (Tensor): Input tensor (B, subbands, T // subbands). 127 | Returns: 128 | Tensor: Output tensor (B, 1, T). 129 | """ 130 | # NOTE(kan-bayashi): Power will be dreased so here multipy by # subbands. 131 | # Not sure this is the correct way, it is better to check again. 132 | # TODO(kan-bayashi): Understand the reconstruction procedure 133 | x = F.conv_transpose1d( 134 | x, self.updown_filter * self.subbands, stride=self.subbands 135 | ) 136 | return F.conv1d(self.pad_fn(x), self.synthesis_filter) 137 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | Cython 2 | matplotlib 3 | tensorboard 4 | kiwipiepy 5 | librosa==0.8.0 6 | numpy 7 | scipy 8 | Unidecode 9 | omegaconf 10 | alias_free_torch 11 | phaseaug 12 | pyopenjtalk==0.2.0 13 | inflect 14 | phaseaug 15 | soundcard 16 | soundfile 17 | -------------------------------------------------------------------------------- /text/__init__.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | import re 3 | from unicodedata import normalize 4 | 5 | from text.cleaners import collapse_whitespace 6 | from text.symbols import lang_to_dict, lang_to_dict_inverse 7 | 8 | 9 | def text_to_sequence(raw_text, lang): 10 | '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 11 | Args: 12 | text: string to convert to a sequence 13 | lang: language of the input text 14 | Returns: 15 | List of integers corresponding to the symbols in the text 16 | ''' 17 | 18 | _symbol_to_id = lang_to_dict(lang) 19 | text = collapse_whitespace(raw_text) 20 | 21 | if lang == 'ko_KR': 22 | text = normalize('NFKD', text) 23 | sequence = [_symbol_to_id[symbol] for symbol in text] 24 | tone = [0 for i in sequence] 25 | 26 | elif lang == 'en_US': 27 | _curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)') 28 | sequence = [] 29 | 30 | while len(text): 31 | m = _curly_re.match(text) 32 | 33 | if m is not None: 34 | ar = m.group(1) 35 | sequence += [_symbol_to_id[symbol] for symbol in ar] 36 | ar = m.group(2) 37 | sequence += [_symbol_to_id[symbol] for symbol in ar.split()] 38 | text = m.group(3) 39 | else: 40 | sequence += [_symbol_to_id[symbol] for symbol in text] 41 | break 42 | 43 | tone = [0 for i in sequence] 44 | 45 | ### Add ### 46 | elif lang == 'pyopenjtalk_prosody': 47 | sequence = [] 48 | phonomes = pyopenjtalk_g2p_prosody(text) # 文章⇒音素 49 | sequence += [_symbol_to_id[symbol] for symbol in phonomes] # 音素⇒index番号 50 | tone = [0 for i in sequence] # 音素分だけtoneも追加(なにこれ?) 51 | ########### 52 | 53 | else: 54 | raise RuntimeError('Wrong type of lang') 55 | 56 | assert len(sequence) == len(tone) 57 | return sequence, tone 58 | 59 | 60 | def sequence_to_text(sequence, lang): 61 | '''Converts a sequence of IDs back to a string''' 62 | _id_to_symbol = lang_to_dict_inverse(lang) 63 | result = '' 64 | for symbol_id in sequence: 65 | s = _id_to_symbol[symbol_id] 66 | result += s 67 | return result 68 | 69 | 70 | def _clean_text(text, cleaner_names): 71 | for name in cleaner_names: 72 | cleaner = getattr(cleaners, name) 73 | if not cleaner: 74 | raise Exception('Unknown cleaner: %s' % name) 75 | text = cleaner(text) 76 | return text 77 | 78 | ### Add from espnet ### 79 | # ESPNet:https://github.com/espnet/espnet 80 | ####################### 81 | def pyopenjtalk_g2p_prosody(text: str, drop_unvoiced_vowels: bool = True) : 82 | """Extract phoneme + prosoody symbol sequence from input full-context labels. 83 | 84 | The algorithm is based on `Prosodic features control by symbols as input of 85 | sequence-to-sequence acoustic modeling for neural TTS`_ with some r9y9's tweaks. 86 | 87 | Args: 88 | text (str): Input text. 89 | drop_unvoiced_vowels (bool): whether to drop unvoiced vowels. 90 | 91 | Returns: 92 | List[str]: List of phoneme + prosody symbols. 93 | 94 | Examples: 95 | >>> from espnet2.text.phoneme_tokenizer import pyopenjtalk_g2p_prosody 96 | >>> pyopenjtalk_g2p_prosody("こんにちは。") 97 | ['^', 'k', 'o', '[', 'N', 'n', 'i', 'ch', 'i', 'w', 'a', '$'] 98 | 99 | .. _`Prosodic features control by symbols as input of sequence-to-sequence acoustic 100 | modeling for neural TTS`: https://doi.org/10.1587/transinf.2020EDP7104 101 | 102 | """ 103 | labels = _extract_fullcontext_label(text) 104 | N = len(labels) 105 | 106 | phones = [] 107 | for n in range(N): 108 | lab_curr = labels[n] 109 | 110 | # current phoneme 111 | p3 = re.search(r"\-(.*?)\+", lab_curr).group(1) 112 | 113 | # deal unvoiced vowels as normal vowels 114 | if drop_unvoiced_vowels and p3 in "AEIOU": 115 | p3 = p3.lower() 116 | 117 | # deal with sil at the beginning and the end of text 118 | if p3 == "sil": 119 | assert n == 0 or n == N - 1 120 | if n == 0: 121 | phones.append("^") 122 | elif n == N - 1: 123 | # check question form or not 124 | e3 = _numeric_feature_by_regex(r"!(\d+)_", lab_curr) 125 | if e3 == 0: 126 | phones.append("$") 127 | elif e3 == 1: 128 | phones.append("?") 129 | continue 130 | elif p3 == "pau": 131 | phones.append("_") 132 | continue 133 | else: 134 | phones.append(p3) 135 | 136 | # accent type and position info (forward or backward) 137 | a1 = _numeric_feature_by_regex(r"/A:([0-9\-]+)\+", lab_curr) 138 | a2 = _numeric_feature_by_regex(r"\+(\d+)\+", lab_curr) 139 | a3 = _numeric_feature_by_regex(r"\+(\d+)/", lab_curr) 140 | 141 | # number of mora in accent phrase 142 | f1 = _numeric_feature_by_regex(r"/F:(\d+)_", lab_curr) 143 | 144 | a2_next = _numeric_feature_by_regex(r"\+(\d+)\+", labels[n + 1]) 145 | # accent phrase border 146 | if a3 == 1 and a2_next == 1 and p3 in "aeiouAEIOUNcl": 147 | phones.append("#") 148 | # pitch falling 149 | elif a1 == 0 and a2_next == a2 + 1 and a2 != f1: 150 | phones.append("]") 151 | # pitch rising 152 | elif a2 == 1 and a2_next == 2: 153 | phones.append("[") 154 | 155 | return phones 156 | 157 | from packaging.version import parse as V 158 | def _extract_fullcontext_label(text): 159 | import pyopenjtalk 160 | 161 | if V(pyopenjtalk.__version__) >= V("0.3.0"): 162 | return pyopenjtalk.make_label(pyopenjtalk.run_frontend(text)) 163 | else: 164 | return pyopenjtalk.run_frontend(text)[1] 165 | 166 | 167 | def _numeric_feature_by_regex(regex, s): 168 | match = re.search(regex, s) 169 | if match is None: 170 | return -50 171 | return int(match.group(1)) 172 | 173 | ####################### -------------------------------------------------------------------------------- /text/cleaners.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | ''' 4 | Cleaners are transformations that run over the input text at both training and eval time. 5 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" 6 | hyperparameter. Some cleaners are English-specific. You'll typically want to use: 7 | 1. "english_cleaners" for English text 8 | 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using 9 | the Unidecode library (https://pypi.python.org/pypi/Unidecode) 10 | 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update 11 | the symbols in symbols.py to match your data). 12 | ''' 13 | 14 | import re 15 | from unidecode import unidecode 16 | from unicodedata import normalize 17 | 18 | from .numbers import normalize_numbers 19 | 20 | 21 | # Regular expression matching whitespace: 22 | _whitespace_re = re.compile(r'\s+') 23 | 24 | # List of (regular expression, replacement) pairs for abbreviations: 25 | _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ 26 | ('mrs', 'misess'), 27 | ('mr', 'mister'), 28 | ('dr', 'doctor'), 29 | ('st', 'saint'), 30 | ('co', 'company'), 31 | ('jr', 'junior'), 32 | ('maj', 'major'), 33 | ('gen', 'general'), 34 | ('drs', 'doctors'), 35 | ('rev', 'reverend'), 36 | ('lt', 'lieutenant'), 37 | ('hon', 'honorable'), 38 | ('sgt', 'sergeant'), 39 | ('capt', 'captain'), 40 | ('esq', 'esquire'), 41 | ('ltd', 'limited'), 42 | ('col', 'colonel'), 43 | ('ft', 'fort'), 44 | ]] 45 | 46 | _cht_norm = [(re.compile(r'[%s]' % x[0]), x[1]) for x in [ 47 | ('。.;', '.'), 48 | (',、', ', '), 49 | ('?', '?'), 50 | ('!', '!'), 51 | ('─‧', '-'), 52 | ('…', '...'), 53 | ('《》「」『』〈〉()', "'"), 54 | (':︰', ':'), 55 | (' ', ' ') 56 | ]] 57 | 58 | def expand_abbreviations(text): 59 | for regex, replacement in _abbreviations: 60 | text = re.sub(regex, replacement, text) 61 | return text 62 | 63 | def expand_numbers(text): 64 | return normalize_numbers(text) 65 | 66 | def lowercase(text): 67 | return text.lower() 68 | 69 | def collapse_whitespace(text): 70 | return re.sub(_whitespace_re, ' ', text) 71 | 72 | def convert_to_ascii(text): 73 | return unidecode(text) 74 | 75 | def english_cleaners(text): 76 | '''Pipeline for English text, including abbreviation expansion.''' 77 | text = convert_to_ascii(text) 78 | #text = lowercase(text) 79 | text = expand_numbers(text) 80 | text = expand_abbreviations(text) 81 | text = collapse_whitespace(text) 82 | return text 83 | 84 | def korean_cleaners(text): 85 | '''Pipeline for Korean text, including collapses whitespace.''' 86 | text = collapse_whitespace(text) 87 | text = normalize('NFKD', text) 88 | return text 89 | 90 | -------------------------------------------------------------------------------- /text/numbers.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | import inflect 4 | import re 5 | 6 | 7 | _inflect = inflect.engine() 8 | _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') 9 | _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') 10 | _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') 11 | _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') 12 | _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') 13 | _number_re = re.compile(r'[0-9]+') 14 | 15 | 16 | def _remove_commas(m): 17 | return m.group(1).replace(',', '') 18 | 19 | 20 | def _expand_decimal_point(m): 21 | return m.group(1).replace('.', ' point ') 22 | 23 | 24 | def _expand_dollars(m): 25 | match = m.group(1) 26 | parts = match.split('.') 27 | if len(parts) > 2: 28 | return match + ' dollars' # Unexpected format 29 | dollars = int(parts[0]) if parts[0] else 0 30 | cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 31 | if dollars and cents: 32 | dollar_unit = 'dollar' if dollars == 1 else 'dollars' 33 | cent_unit = 'cent' if cents == 1 else 'cents' 34 | return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) 35 | elif dollars: 36 | dollar_unit = 'dollar' if dollars == 1 else 'dollars' 37 | return '%s %s' % (dollars, dollar_unit) 38 | elif cents: 39 | cent_unit = 'cent' if cents == 1 else 'cents' 40 | return '%s %s' % (cents, cent_unit) 41 | else: 42 | return 'zero dollars' 43 | 44 | 45 | def _expand_ordinal(m): 46 | return _inflect.number_to_words(m.group(0)) 47 | 48 | 49 | def _expand_number(m): 50 | num = int(m.group(0)) 51 | if num > 1000 and num < 3000: 52 | if num == 2000: 53 | return 'two thousand' 54 | elif num > 2000 and num < 2010: 55 | return 'two thousand ' + _inflect.number_to_words(num % 100) 56 | elif num % 100 == 0: 57 | return _inflect.number_to_words(num // 100) + ' hundred' 58 | else: 59 | return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') 60 | else: 61 | return _inflect.number_to_words(num, andword='') 62 | 63 | 64 | def normalize_numbers(text): 65 | text = re.sub(_comma_number_re, _remove_commas, text) 66 | text = re.sub(_pounds_re, r'\1 pounds', text) 67 | text = re.sub(_dollars_re, _expand_dollars, text) 68 | text = re.sub(_decimal_number_re, _expand_decimal_point, text) 69 | text = re.sub(_ordinal_re, _expand_ordinal, text) 70 | text = re.sub(_number_re, _expand_number, text) 71 | return text -------------------------------------------------------------------------------- /text/symbols.py: -------------------------------------------------------------------------------- 1 | _pad = '_' 2 | _punc = ";:,.!?¡¿—-…«»'“”~() " 3 | 4 | _jamo_leads = "".join([chr(_) for _ in range(0x1100, 0x1113)]) 5 | _jamo_vowels = "".join([chr(_) for _ in range(0x1161, 0x1176)]) 6 | _jamo_tails = "".join([chr(_) for _ in range(0x11A8, 0x11C3)]) 7 | _kor_characters = _jamo_leads + _jamo_vowels + _jamo_tails 8 | 9 | _cmu_characters = [ 10 | 'AA', 'AE', 'AH', 11 | 'AO', 'AW', 'AY', 12 | 'B', 'CH', 'D', 'DH', 'EH', 'ER', 'EY', 13 | 'F', 'G', 'HH', 'IH', 'IY', 14 | 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OY', 15 | 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UW', 16 | 'V', 'W', 'Y', 'Z', 'ZH' 17 | ] 18 | 19 | ### add Japanese phonomes by pyopenjtalk_g2p_prosody ### 20 | _ja_characters = [ 21 | '_' , '#' , '$' , '[' , '?' , ']' , '^' , 22 | 'a' , 'b' , 'by', 'ch', 'cl', 'd' , 'dy', 23 | 'e' , 'f' , 'g' , 'gy', 'h' , 'hy', 'i' , 24 | 'j' , 'k' , 'ky', 'm' , 'my', 'n' , 'N' , 25 | 'ny', 'o' , 'p' , 'py', 'r' , 'ry', 's' , 26 | 'sh', 't' , 'ts', 'ty', 'u' , 'v' , 'w', 27 | 'y' , 'z' ] 28 | ######################################################## 29 | 30 | lang_to_symbols = { 31 | 'common': [_pad] + list(_punc), 32 | 'ko_KR': list(_kor_characters), 33 | 'en_US': _cmu_characters, 34 | 'pyopenjtalk_prosody': _ja_characters, 35 | } 36 | 37 | def lang_to_dict(lang): 38 | 39 | ### add ### 40 | if lang == "pyopenjtalk_prosody": 41 | symbol_lang = lang_to_symbols[lang] 42 | else: 43 | symbol_lang = lang_to_symbols['common'] + lang_to_symbols[lang] 44 | ############ 45 | 46 | dict_lang = {s: i for i, s in enumerate(symbol_lang)} 47 | return dict_lang 48 | 49 | def lang_to_dict_inverse(lang): 50 | 51 | ### add ### 52 | if lang == "pyopenjtalk_prosody": 53 | symbol_lang = lang_to_symbols[lang] 54 | else: 55 | symbol_lang = lang_to_symbols['common'] + lang_to_symbols[lang] 56 | ########### 57 | 58 | dict_lang = {i: s for i, s in enumerate(symbol_lang)} 59 | return dict_lang 60 | 61 | def symbol_len(lang): 62 | 63 | ### add ### 64 | if lang == "pyopenjtalk_prosody": 65 | symbol_lang = lang_to_symbols[lang] 66 | else: 67 | symbol_lang = lang_to_symbols['common'] + lang_to_symbols[lang] 68 | ########### 69 | 70 | return len(symbol_lang) 71 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | # modified from https://github.com/jaywalnut310/vits 2 | import os 3 | import argparse 4 | import torch 5 | from torch.nn import functional as F 6 | from torch.utils.data import DataLoader 7 | from torch.utils.tensorboard import SummaryWriter 8 | import torch.multiprocessing as mp 9 | import torch.distributed as dist 10 | from torch.nn.parallel import DistributedDataParallel as DDP 11 | from torch.cuda.amp import autocast, GradScaler 12 | from tqdm import tqdm 13 | from phaseaug.phaseaug import PhaseAug 14 | import commons 15 | import utils 16 | from data_utils import (TextAudioSpeakerLoader, TextAudioSpeakerCollate, 17 | DistributedBucketSampler, create_spec, dataset_check) 18 | from models import ( 19 | SynthesizerTrn, 20 | # MultiPeriodDiscriminator, 21 | AvocodoDiscriminator) 22 | from losses import (generator_loss, discriminator_loss, feature_loss, kl_loss) 23 | from mel_processing import mel_spectrogram_torch, spec_to_mel_torch 24 | from text.symbols import symbol_len 25 | import math 26 | 27 | torch.backends.cudnn.benchmark = True 28 | global_step = 0 29 | 30 | 31 | def main(args): 32 | """Assume Single Node Multi GPUs Training Only""" 33 | assert torch.cuda.is_available(), "CPU training is not allowed." 34 | 35 | n_gpus = torch.cuda.device_count() 36 | os.environ['MASTER_ADDR'] = 'localhost' 37 | os.environ['MASTER_PORT'] = '12345' ### modify ### 38 | 39 | hps = utils.get_hparams(args) 40 | # create spectrogram files ### modify ### 41 | #create_spec(hps.data.training_files, hps.data) 42 | #create_spec(hps.data.validation_files, hps.data) 43 | dataset_check(hps.data.validation_files, hps.data) 44 | dataset_check(hps.data.training_files, hps.data) 45 | 46 | run(rank=0, n_gpus=1, hps=hps, args=args) ### modify ### 47 | #mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps, args)) 48 | 49 | 50 | def count_parameters(model, scale=1000000): 51 | return sum(p.numel() 52 | for p in model.parameters() if p.requires_grad) / scale 53 | 54 | 55 | def run(rank, n_gpus, hps, args): 56 | global global_step 57 | if rank == 0: 58 | logger = utils.get_logger(hps.model_dir) 59 | logger.info('MODEL NAME: {} in {}'.format(args.model, hps.model_dir)) 60 | logger.info( 61 | 'GPU: Use {} gpu(s) with batch size {} (FP16 running: {})'.format( 62 | n_gpus, hps.train.batch_size, hps.train.fp16_run)) 63 | 64 | utils.check_git_hash(hps.model_dir) 65 | writer = SummaryWriter(log_dir=hps.model_dir) 66 | 67 | dist.init_process_group(backend='nccl', 68 | init_method='env://', 69 | world_size=n_gpus, 70 | rank=rank, 71 | group_name=args.model) 72 | torch.manual_seed(hps.train.seed) 73 | torch.cuda.set_device(rank) 74 | 75 | use_persistent_workers = hps.data.persistent_workers 76 | use_pin_memory = not use_persistent_workers 77 | train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps.data, 78 | rank == 0 and args.initial_run) 79 | collate_fn = TextAudioSpeakerCollate() 80 | if rank == 0: 81 | eval_dataset = TextAudioSpeakerLoader(hps.data.validation_files, 82 | hps.data, rank == 0 83 | and args.initial_run) 84 | eval_loader = DataLoader( 85 | eval_dataset, 86 | num_workers=10, 87 | shuffle=False, 88 | batch_size=hps.train.batch_size, 89 | pin_memory=use_pin_memory, 90 | drop_last=False, 91 | collate_fn=collate_fn, 92 | persistent_workers=use_persistent_workers, 93 | ) 94 | elif args.initial_run: 95 | print(f'rank: {rank} is waiting...') 96 | dist.barrier() 97 | if rank == 0: 98 | logger.info('Training Started') 99 | train_sampler = DistributedBucketSampler( 100 | train_dataset, 101 | hps.train.batch_size, [32, 300, 400, 500, 600, 700, 800, 900, 1000], 102 | num_replicas=n_gpus, 103 | rank=rank, 104 | shuffle=True) 105 | train_loader = DataLoader(train_dataset, 106 | num_workers=10, 107 | shuffle=False, 108 | pin_memory=use_pin_memory, 109 | collate_fn=collate_fn, 110 | persistent_workers=use_persistent_workers, 111 | batch_sampler=train_sampler) 112 | 113 | net_g = SynthesizerTrn(symbol_len(hps.data.languages), 114 | hps.data.filter_length // 2 + 1, 115 | hps.train.segment_size // hps.data.hop_length, 116 | n_speakers=len(hps.data.speakers), 117 | midi_start=hps.data.midi_start, 118 | midi_end=hps.data.midi_end, 119 | octave_range=hps.data.octave_range, 120 | **hps.model, 121 | 122 | ### add ### 123 | sr=hps.data.sampling_rate, 124 | W=hps.data.ying_window, 125 | w_step=hps.data.ying_hop, 126 | tau_max=hps.data.tau_max 127 | ########### 128 | 129 | ).cuda(rank) 130 | net_d = AvocodoDiscriminator(hps.model.use_spectral_norm, hps.train.segment_size).cuda(rank) 131 | if rank == 0: 132 | logger.info('MODEL SIZE: G {:.2f}M and D {:.2f}M'.format( 133 | count_parameters(net_g), 134 | count_parameters(net_d), 135 | )) 136 | 137 | optim_g = torch.optim.AdamW(net_g.parameters(), 138 | hps.train.learning_rate, 139 | betas=hps.train.betas, 140 | eps=hps.train.eps) 141 | optim_d = torch.optim.AdamW(net_d.parameters(), 142 | hps.train.learning_rate, 143 | betas=hps.train.betas, 144 | eps=hps.train.eps) 145 | #net_g = DDP(net_g, device_ids=[rank], find_unused_parameters=True) 146 | #net_d = DDP(net_d, device_ids=[rank], find_unused_parameters=True) 147 | net_g = DDP(net_g, device_ids=[rank], find_unused_parameters=False) ### modify ### 148 | net_d = DDP(net_d, device_ids=[rank], find_unused_parameters=False) ### modify ### 149 | 150 | if args.transfer: 151 | _, _, _, _, _, _, _ = utils.load_checkpoint(args.transfer, rank, net_g, 152 | net_d, None, None) 153 | epoch_str = 1 154 | global_step = 0 155 | 156 | elif args.force_resume: 157 | _, _, _, epoch_save, _ = utils.load_checkpoint_diffsize( 158 | args.force_resume, rank, net_g, net_d) 159 | epoch_str = epoch_save + 1 160 | global_step = epoch_save * len(train_loader) + 1 161 | scheduler_g = torch.optim.lr_scheduler.ExponentialLR( 162 | optim_g, gamma=hps.train.lr_decay, last_epoch=-1) 163 | scheduler_d = torch.optim.lr_scheduler.ExponentialLR( 164 | optim_d, gamma=hps.train.lr_decay, last_epoch=-1) 165 | elif args.resume: 166 | _, _, _, _, _, epoch_save, _ = utils.load_checkpoint( 167 | args.resume, rank, net_g, net_d, optim_g, optim_d) 168 | epoch_str = epoch_save + 1 169 | global_step = epoch_save * len(train_loader) + 1 170 | else: 171 | epoch_str = 1 172 | global_step = 0 173 | 174 | scheduler_g = torch.optim.lr_scheduler.ExponentialLR( 175 | optim_g, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2) 176 | scheduler_d = torch.optim.lr_scheduler.ExponentialLR( 177 | optim_d, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2) 178 | 179 | scaler = GradScaler(enabled=hps.train.fp16_run) 180 | if rank == 0: 181 | outer_bar = tqdm(total=hps.train.epochs, 182 | desc="Training", 183 | position=0, 184 | leave=False) 185 | outer_bar.update(epoch_str) 186 | 187 | for epoch in range(epoch_str, hps.train.epochs + 1): 188 | if rank == 0: 189 | train_and_evaluate(rank, epoch, hps, [net_g, net_d], 190 | [optim_g, optim_d], [scheduler_g, scheduler_d], 191 | scaler, [train_loader, eval_loader], writer) 192 | else: 193 | train_and_evaluate(rank, epoch, hps, [net_g, net_d], 194 | [optim_g, optim_d], [scheduler_g, scheduler_d], 195 | scaler, [train_loader, None], None) 196 | scheduler_g.step() 197 | scheduler_d.step() 198 | if rank == 0: 199 | outer_bar.update(1) 200 | 201 | 202 | def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scaler, 203 | loaders, writer): 204 | net_g, net_d = nets 205 | optim_g, optim_d = optims 206 | scheduler_g, scheduler_d = schedulers 207 | train_loader, eval_loader = loaders 208 | aug = PhaseAug().cuda(rank) 209 | train_loader.batch_sampler.set_epoch(epoch) 210 | global global_step 211 | 212 | net_g.train() 213 | net_d.train() 214 | if rank == 0: 215 | inner_bar = tqdm(total=len(train_loader), 216 | desc="Epoch {}".format(epoch), 217 | position=1, 218 | leave=False) 219 | 220 | for batch_idx, (x, x_lengths, spec, spec_lengths, ying, ying_lengths, y, 221 | y_lengths, speakers, tone) in enumerate(train_loader): 222 | x, x_lengths = x.cuda(rank, non_blocking=True), x_lengths.cuda( 223 | rank, non_blocking=True) 224 | spec, spec_lengths = spec.cuda( 225 | rank, non_blocking=True), spec_lengths.cuda(rank, 226 | non_blocking=True) 227 | ying, ying_lengths = ying.cuda( 228 | rank, non_blocking=True), ying_lengths.cuda(rank, 229 | non_blocking=True) 230 | 231 | y, y_lengths = y.cuda(rank, non_blocking=True), y_lengths.cuda( 232 | rank, non_blocking=True) 233 | speakers = speakers.cuda(rank, non_blocking=True) 234 | tone = tone.cuda(rank, non_blocking=True) 235 | 236 | with autocast(enabled=hps.train.fp16_run): 237 | y_hat, l_length, attn, ids_slice, x_mask, z_mask, y_hat_, \ 238 | (z, z_p, m_p, logs_p, m_q, logs_q), _, \ 239 | (z_spec, m_spec, logs_spec, spec_mask, z_yin, m_yin, logs_yin, yin_mask), \ 240 | (yin_gt_crop, yin_gt_shifted_crop, yin_dec_crop, yin_hat_crop, scope_shift, yin_hat_shifted) \ 241 | = net_g( 242 | x, tone, x_lengths, spec, spec_lengths, ying, ying_lengths, speakers 243 | ) 244 | mel = spec_to_mel_torch(spec, hps.data.filter_length, 245 | hps.data.n_mel_channels, 246 | hps.data.sampling_rate, hps.data.mel_fmin, 247 | hps.data.mel_fmax) 248 | y_mel = commons.slice_segments( 249 | mel, ids_slice, hps.train.segment_size // hps.data.hop_length) 250 | y_hat_mel = mel_spectrogram_torch( 251 | y_hat[-1].squeeze(1), hps.data.filter_length, 252 | hps.data.n_mel_channels, hps.data.sampling_rate, 253 | hps.data.hop_length, hps.data.win_length, hps.data.mel_fmin, 254 | hps.data.mel_fmax) 255 | yin_gt_crop = commons.slice_segments( 256 | torch.cat([yin_gt_crop, yin_gt_shifted_crop], dim=0), 257 | ids_slice, hps.train.segment_size // hps.data.hop_length) 258 | 259 | y_ = commons.slice_segments(torch.cat([y, y], dim=0), 260 | ids_slice * hps.data.hop_length, 261 | hps.train.segment_size) # sliced 262 | # Discriminator 263 | with autocast(enabled=False): 264 | aug_y_, aug_y_hat_last = aug.forward_sync( 265 | y_, y_hat_[-1].detach()) 266 | aug_y_hat_ = [_y.detach() for _y in y_hat_[:-1]] 267 | aug_y_hat_.append(aug_y_hat_last) 268 | y_d_hat_r, y_d_hat_g, _, _ = net_d(aug_y_, aug_y_hat_) 269 | with autocast(enabled=False): 270 | loss_disc, losses_disc_r, losses_disc_g = discriminator_loss( 271 | y_d_hat_r, y_d_hat_g) 272 | loss_disc_all = loss_disc 273 | 274 | optim_d.zero_grad() 275 | scaler.scale(loss_disc_all).backward() 276 | scaler.unscale_(optim_d) 277 | grad_norm_d = commons.clip_grad_value_(net_d.parameters(), None) 278 | scaler.step(optim_d) 279 | 280 | p = float(batch_idx + epoch * 281 | len(train_loader)) / hps.train.alpha / len(train_loader) 282 | alpha = 2. / (1. + math.exp(-20 * p)) - 1 283 | 284 | with autocast(enabled=hps.train.fp16_run): 285 | # Generator 286 | with autocast(enabled=False): 287 | aug_y_, aug_y_hat_last = aug.forward_sync(y_, y_hat_[-1]) 288 | aug_y_hat_ = y_hat_ 289 | aug_y_hat_[-1] = aug_y_hat_last 290 | y_d_hat_r, y_d_hat_g, fmap_r, fmap_g = net_d(aug_y_, aug_y_hat_) 291 | with autocast(enabled=False): 292 | loss_dur = torch.sum(l_length.float()) 293 | loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel 294 | loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, 295 | z_mask) * hps.train.c_kl 296 | loss_yin_dec = F.l1_loss(yin_gt_shifted_crop, 297 | yin_dec_crop) * hps.train.c_yin 298 | 299 | ### add ### for frame check 300 | # _, _, gt_length = yin_gt_crop.shape 301 | # _, _, hat_length = yin_hat_crop.shape 302 | # if hat_length < gt_length: 303 | # yin_gt_crop = yin_gt_crop[:,:,:hat_length] 304 | # print(f"yin_gt : {gt_length} | yin_hat : {hat_length}") 305 | # elif hat_length > gt_length: 306 | # yin_hat_crop = yin_hat_crop[:,:,:gt_length] 307 | # print(f"yin_gt : {gt_length} | yin_hat : {hat_length}") 308 | ########### 309 | 310 | loss_yin_shift = F.l1_loss( 311 | torch.exp(-yin_gt_crop), 312 | torch.exp(-yin_hat_crop)) * hps.train.c_yin + F.l1_loss( 313 | torch.exp(-yin_hat_shifted), 314 | torch.exp(-(torch.chunk(yin_hat_crop, 2, dim=0)[1])) 315 | ) * hps.train.c_yin 316 | loss_fm = feature_loss(fmap_r, fmap_g) 317 | loss_gen, losses_gen = generator_loss(y_d_hat_g) 318 | loss_gen_all = loss_gen + loss_fm + loss_mel + loss_dur + loss_kl + loss_yin_shift + loss_yin_dec 319 | optim_g.zero_grad() 320 | scaler.scale(loss_gen_all).backward() 321 | scaler.unscale_(optim_g) 322 | grad_norm_g = commons.clip_grad_value_(net_g.parameters(), None) 323 | scaler.step(optim_g) 324 | scaler.update() 325 | 326 | if rank == 0: 327 | inner_bar.update(1) 328 | inner_bar.set_description( 329 | "Epoch {} | g {: .04f} d {: .04f}|".format( 330 | epoch, loss_gen_all, loss_disc_all)) 331 | if global_step % hps.train.log_interval == 0: 332 | lr = optim_g.param_groups[0]['lr'] 333 | 334 | scalar_dict = { 335 | "learning_rate": lr, 336 | "loss/g/score": sum(losses_gen), 337 | "loss/g/fm": loss_fm, 338 | "loss/g/mel": loss_mel, 339 | "loss/g/dur": loss_dur, 340 | "loss/g/kl": loss_kl, 341 | "loss/g/yindec": loss_yin_dec, 342 | "loss/g/yinshift": loss_yin_shift, 343 | "loss/g/total": loss_gen_all, 344 | "loss/d/real": sum(losses_disc_r), 345 | "loss/d/gen": sum(losses_disc_g), 346 | "loss/d/total": loss_disc_all, 347 | } 348 | 349 | utils.summarize(writer=writer, 350 | global_step=global_step, 351 | scalars=scalar_dict) 352 | if global_step % hps.train.eval_interval == 0: 353 | evaluate(hps, global_step, epoch, net_g, eval_loader, writer) 354 | 355 | global_step += 1 356 | 357 | if rank == 0: 358 | if epoch % hps.train.save_interval == 0: 359 | utils.save_checkpoint( 360 | net_g, optim_g, net_d, optim_d, hps, epoch, 361 | hps.train.learning_rate, 362 | os.path.join(hps.model_dir, 363 | "{}_{}.pth".format(hps.model_name, epoch))) 364 | 365 | 366 | def evaluate(hps, current_step, epoch, generator, eval_loader, writer): 367 | generator.eval() 368 | n_sample = hps.train.n_sample 369 | with torch.no_grad(): 370 | loss_val_mel = 0 371 | loss_val_yin = 0 372 | val_bar = tqdm(total=len(eval_loader), 373 | desc="Validation (Step {})".format(current_step), 374 | position=1, 375 | leave=False) 376 | for batch_idx, (x, x_lengths, spec, spec_lengths, ying, ying_lengths, 377 | y, y_lengths, speakers, 378 | tone) in enumerate(eval_loader): 379 | x, x_lengths = x.cuda(0, non_blocking=True), x_lengths.cuda( 380 | 0, non_blocking=True) 381 | spec, spec_lengths = spec.cuda( 382 | 0, non_blocking=True), spec_lengths.cuda(0, non_blocking=True) 383 | ying, ying_lengths = ying.cuda( 384 | 0, non_blocking=True), ying_lengths.cuda(0, non_blocking=True) 385 | y, y_lengths = y.cuda(0, non_blocking=True), y_lengths.cuda( 386 | 0, non_blocking=True) 387 | speakers = speakers.cuda(0, non_blocking=True) 388 | tone = tone.cuda(0, non_blocking=True) 389 | 390 | with autocast(enabled=hps.train.fp16_run): 391 | y_hat, l_length, attn, ids_slice, x_mask, z_mask, y_hat_, \ 392 | (z, z_p, m_p, logs_p, m_q, logs_q),\ 393 | _,\ 394 | (z_spec, m_spec, logs_spec, spec_mask, z_yin, m_yin, logs_yin, yin_mask), \ 395 | (yin_gt_crop, yin_gt_shifted_crop, yin_dec_crop, yin_hat_crop, scope_shift, yin_hat_shifted) \ 396 | = generator.module( 397 | x, tone, x_lengths, spec, spec_lengths, ying, ying_lengths, speakers 398 | ) 399 | 400 | mel = spec_to_mel_torch(spec, hps.data.filter_length, 401 | hps.data.n_mel_channels, 402 | hps.data.sampling_rate, 403 | hps.data.mel_fmin, hps.data.mel_fmax) 404 | y_mel = commons.slice_segments( 405 | mel, ids_slice, 406 | hps.train.segment_size // hps.data.hop_length) 407 | y_hat_mel = mel_spectrogram_torch( 408 | y_hat[-1].squeeze(1), hps.data.filter_length, 409 | hps.data.n_mel_channels, hps.data.sampling_rate, 410 | hps.data.hop_length, hps.data.win_length, 411 | hps.data.mel_fmin, hps.data.mel_fmax) 412 | with autocast(enabled=False): 413 | loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel 414 | loss_val_mel += loss_mel.item() 415 | loss_yin = F.l1_loss(yin_gt_shifted_crop, 416 | yin_dec_crop) * hps.train.c_yin 417 | loss_val_yin += loss_yin.item() 418 | 419 | if batch_idx == 0: 420 | x = x[:n_sample] 421 | x_lengths = x_lengths[:n_sample] 422 | spec = spec[:n_sample] 423 | spec_lengths = spec_lengths[:n_sample] 424 | ying = ying[:n_sample] 425 | ying_lengths = ying_lengths[:n_sample] 426 | y = y[:n_sample] 427 | y_lengths = y_lengths[:n_sample] 428 | speakers = speakers[:n_sample] 429 | tone = tone[:1] 430 | 431 | decoder_inputs, _, mask, (z_crop, z, *_) \ 432 | = generator.module.infer_pre_decoder(x, tone, x_lengths, speakers, max_len=2000) 433 | y_hat = generator.module.infer_decode_chunk( 434 | decoder_inputs, speakers) 435 | 436 | #scope-shifted 437 | z_spec, z_yin = torch.split(z, 438 | hps.model.inter_channels - 439 | hps.model.yin_channels, 440 | dim=1) 441 | z_yin_crop = generator.module.crop_scope([z_yin], 6)[0] 442 | z_crop_shift = torch.cat([z_spec, z_yin_crop], dim=1) 443 | decoder_inputs_shift = z_crop_shift * mask 444 | y_hat_shift = generator.module.infer_decode_chunk( 445 | decoder_inputs_shift, speakers) 446 | z_yin = z_yin * mask 447 | yin_hat = generator.module.yin_dec_infer(z_yin, mask, speakers) 448 | 449 | y_hat_mel_length = mask.sum([1, 2]).long() 450 | y_hat_lengths = y_hat_mel_length * hps.data.hop_length 451 | 452 | mel = spec_to_mel_torch(spec, hps.data.filter_length, 453 | hps.data.n_mel_channels, 454 | hps.data.sampling_rate, 455 | hps.data.mel_fmin, hps.data.mel_fmax) 456 | y_hat_mel = mel_spectrogram_torch( 457 | y_hat.squeeze(1).float(), hps.data.filter_length, 458 | hps.data.n_mel_channels, hps.data.sampling_rate, 459 | hps.data.hop_length, hps.data.win_length, 460 | hps.data.mel_fmin, hps.data.mel_fmax) 461 | y_hat_shift_mel = mel_spectrogram_torch( 462 | y_hat_shift.squeeze(1).float(), hps.data.filter_length, 463 | hps.data.n_mel_channels, hps.data.sampling_rate, 464 | hps.data.hop_length, hps.data.win_length, 465 | hps.data.mel_fmin, hps.data.mel_fmax) 466 | y_hat_pad = F.pad( 467 | y_hat, (hps.data.filter_length - hps.data.hop_length, 468 | hps.data.filter_length - hps.data.hop_length + 469 | (-y_hat.shape[-1]) % hps.data.hop_length + 470 | hps.data.hop_length * 471 | (y_hat.shape[-1] % hps.data.hop_length == 0)), 472 | mode='reflect').squeeze(1) 473 | y_hat_shift_pad = F.pad( 474 | y_hat_shift, 475 | (hps.data.filter_length - hps.data.hop_length, 476 | hps.data.filter_length - hps.data.hop_length + 477 | (-y_hat.shape[-1]) % hps.data.hop_length + 478 | hps.data.hop_length * 479 | (y_hat.shape[-1] % hps.data.hop_length == 0)), 480 | mode='reflect').squeeze(1) 481 | ying_hat = generator.module.pitch.yingram(y_hat_pad) 482 | ying_hat_shift = generator.module.pitch.yingram( 483 | y_hat_shift_pad) 484 | 485 | if y_hat_mel.size(2) < mel.size(2): 486 | zero = torch.full((n_sample, y_hat_mel.size(1), 487 | mel.size(2) - y_hat_mel.size(2)), 488 | -11.5129).to(y_hat_mel.device) 489 | y_hat_mel = torch.cat((y_hat_mel, zero), dim=2) 490 | y_hat_shift_mel = torch.cat((y_hat_shift_mel, zero), dim=2) 491 | zero = torch.full((n_sample, yin_hat.size(1), 492 | mel.size(2) - yin_hat.size(2)), 493 | 0).to(y_hat_mel.device) 494 | yin_hat = torch.cat((yin_hat, zero), dim=2) 495 | zero = torch.full((n_sample, ying_hat.size(1), 496 | mel.size(2) - ying_hat.size(2)), 497 | 0).to(y_hat_mel.device) 498 | ying_hat = torch.cat((ying_hat, zero), dim=2) 499 | ying_hat_shift = torch.cat((ying_hat_shift, zero), dim=2) 500 | zero = torch.full( 501 | (n_sample, z_yin.size(1), mel.size(2) - z_yin.size(2)), 502 | 0).to(y_hat_mel.device) 503 | z_yin = torch.cat((z_yin, zero), dim=2) 504 | 505 | ids = torch.arange(0, mel.size(2)).unsqueeze(0).expand( 506 | mel.size(1), 507 | -1).unsqueeze(0).expand(n_sample, -1, 508 | -1).to(y_hat_mel_length.device) 509 | mask = ids > y_hat_mel_length.unsqueeze(1).expand( 510 | -1, mel.size(1)).unsqueeze(2).expand( 511 | -1, -1, mel.size(2)) 512 | y_hat_mel.masked_fill_(mask, -11.5129) 513 | y_hat_shift_mel.masked_fill_(mask, -11.5129) 514 | 515 | image_dict = dict() 516 | audio_dict = dict() 517 | for i in range(n_sample): 518 | image_dict.update({ 519 | "gen/{}_mel".format(i): 520 | utils.plot_spectrogram_to_numpy( 521 | y_hat_mel[i].cpu().numpy()) 522 | }) 523 | audio_dict.update({ 524 | "gen/{}_audio".format(i): 525 | y_hat[i, :, :y_hat_lengths[i]] 526 | }) 527 | image_dict.update({ 528 | "gen/{}_mel_shift".format(i): 529 | utils.plot_spectrogram_to_numpy( 530 | y_hat_shift_mel[i].cpu().numpy()) 531 | }) 532 | audio_dict.update({ 533 | "gen/{}_audio_shift".format(i): 534 | y_hat_shift[i, :, :y_hat_lengths[i]] 535 | }) 536 | image_dict.update({ 537 | "gen/{}_z_yin".format(i): 538 | utils.plot_spectrogram_to_numpy(z_yin[i].cpu().numpy()) 539 | }) 540 | image_dict.update({ 541 | "gen/{}_yin_dec".format(i): 542 | utils.plot_spectrogram_to_numpy( 543 | yin_hat[i].cpu().numpy()) 544 | }) 545 | image_dict.update({ 546 | "gen/{}_ying".format(i): 547 | utils.plot_spectrogram_to_numpy( 548 | ying_hat[i].cpu().numpy()) 549 | }) 550 | image_dict.update({ 551 | "gen/{}_ying_shift".format(i): 552 | utils.plot_spectrogram_to_numpy( 553 | ying_hat_shift[i].cpu().numpy()) 554 | }) 555 | 556 | if current_step == 0: 557 | for i in range(n_sample): 558 | image_dict.update({ 559 | "gt/{}_mel".format(i): 560 | utils.plot_spectrogram_to_numpy( 561 | mel[i].cpu().numpy()) 562 | }) 563 | image_dict.update({ 564 | "gt/{}_ying".format(i): 565 | utils.plot_spectrogram_to_numpy( 566 | ying[i].cpu().numpy()) 567 | }) 568 | audio_dict.update( 569 | {"gt/{}_audio".format(i): y[i, :, :y_lengths[i]]}) 570 | 571 | utils.summarize(writer=writer, 572 | global_step=epoch, 573 | images=image_dict, 574 | audios=audio_dict, 575 | audio_sampling_rate=hps.data.sampling_rate) 576 | val_bar.update(1) 577 | loss_val_mel = loss_val_mel / len(eval_loader) 578 | loss_val_yin = loss_val_yin / len(eval_loader) 579 | 580 | scalar_dict = { 581 | "loss/val/mel": loss_val_mel, 582 | "loss/val/yin": loss_val_yin, 583 | } 584 | utils.summarize(writer=writer, 585 | global_step=current_step, 586 | scalars=scalar_dict) 587 | generator.train() 588 | 589 | 590 | if __name__ == "__main__": 591 | 592 | ### add ### 593 | # model_name = input("Enter your experimental model name...") 594 | ########### 595 | 596 | parser = argparse.ArgumentParser() 597 | parser.add_argument('-c', 598 | '--config', 599 | type=str, 600 | #default="./configs/default.yaml", 601 | default="./configs/config_ja_44100.yaml", ### add ### 602 | help='Path to configuration file') 603 | parser.add_argument('-m', 604 | '--model', 605 | type=str, 606 | default="test", 607 | #required=True, 608 | help='Model name') 609 | parser.add_argument('-r', 610 | '--resume', 611 | type=str, 612 | help='Path to checkpoint for resume') 613 | parser.add_argument('-f', 614 | '--force_resume', 615 | type=str, 616 | help='Path to checkpoint for force resume') 617 | parser.add_argument('-t', 618 | '--transfer', 619 | type=str, 620 | help='Path to baseline checkpoint for transfer') 621 | parser.add_argument('-w', 622 | '--ignore_warning', 623 | default=True, 624 | action="store_true", 625 | help='Ignore warning message') 626 | parser.add_argument('-i', 627 | '--initial_run', 628 | action="store_true", 629 | help='Inintial run for saving pt files') 630 | args = parser.parse_args() 631 | if args.ignore_warning: 632 | import warnings 633 | warnings.filterwarnings(action='ignore') 634 | 635 | main(args) 636 | -------------------------------------------------------------------------------- /transforms.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import numpy as np 3 | import torch 4 | from torch.nn import functional as F 5 | 6 | 7 | DEFAULT_MIN_BIN_WIDTH = 1e-3 8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3 9 | DEFAULT_MIN_DERIVATIVE = 1e-3 10 | 11 | 12 | def piecewise_rational_quadratic_transform( 13 | inputs, 14 | unnormalized_widths, 15 | unnormalized_heights, 16 | unnormalized_derivatives, 17 | inverse=False, 18 | tails=None, 19 | tail_bound=1., 20 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 21 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 22 | min_derivative=DEFAULT_MIN_DERIVATIVE 23 | ): 24 | 25 | if tails is None: 26 | spline_fn = rational_quadratic_spline 27 | spline_kwargs = {} 28 | else: 29 | spline_fn = unconstrained_rational_quadratic_spline 30 | spline_kwargs = { 31 | 'tails': tails, 32 | 'tail_bound': tail_bound 33 | } 34 | 35 | outputs, logabsdet = spline_fn( 36 | inputs=inputs, 37 | unnormalized_widths=unnormalized_widths, 38 | unnormalized_heights=unnormalized_heights, 39 | unnormalized_derivatives=unnormalized_derivatives, 40 | inverse=inverse, 41 | min_bin_width=min_bin_width, 42 | min_bin_height=min_bin_height, 43 | min_derivative=min_derivative, 44 | **spline_kwargs 45 | ) 46 | return outputs, logabsdet 47 | 48 | 49 | def searchsorted(bin_locations, inputs, eps=1e-6): 50 | bin_locations[..., -1] += eps 51 | return torch.sum( 52 | inputs[..., None] >= bin_locations, 53 | dim=-1 54 | ) - 1 55 | 56 | 57 | def unconstrained_rational_quadratic_spline( 58 | inputs, 59 | unnormalized_widths, 60 | unnormalized_heights, 61 | unnormalized_derivatives, 62 | inverse=False, 63 | tails='linear', 64 | tail_bound=1., 65 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 66 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 67 | min_derivative=DEFAULT_MIN_DERIVATIVE 68 | ): 69 | inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound) 70 | outside_interval_mask = ~inside_interval_mask 71 | 72 | outputs = torch.zeros_like(inputs) 73 | logabsdet = torch.zeros_like(inputs) 74 | 75 | if tails == 'linear': 76 | unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1)) 77 | constant = np.log(np.exp(1 - min_derivative) - 1) 78 | unnormalized_derivatives[..., 0] = constant 79 | unnormalized_derivatives[..., -1] = constant 80 | 81 | outputs[outside_interval_mask] = inputs[outside_interval_mask] 82 | logabsdet[outside_interval_mask] = 0 83 | else: 84 | raise RuntimeError('{} tails are not implemented.'.format(tails)) 85 | 86 | outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline( 87 | inputs=inputs[inside_interval_mask], 88 | unnormalized_widths=unnormalized_widths[inside_interval_mask, :], 89 | unnormalized_heights=unnormalized_heights[inside_interval_mask, :], 90 | unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :], 91 | inverse=inverse, 92 | left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound, 93 | min_bin_width=min_bin_width, 94 | min_bin_height=min_bin_height, 95 | min_derivative=min_derivative 96 | ) 97 | 98 | return outputs, logabsdet 99 | 100 | def rational_quadratic_spline( 101 | inputs, 102 | unnormalized_widths, 103 | unnormalized_heights, 104 | unnormalized_derivatives, 105 | inverse=False, 106 | left=0., right=1., bottom=0., top=1., 107 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 108 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 109 | min_derivative=DEFAULT_MIN_DERIVATIVE 110 | ): 111 | if torch.min(inputs) < left or torch.max(inputs) > right: 112 | raise ValueError('Input to a transform is not within its domain') 113 | 114 | num_bins = unnormalized_widths.shape[-1] 115 | 116 | if min_bin_width * num_bins > 1.0: 117 | raise ValueError('Minimal bin width too large for the number of bins') 118 | if min_bin_height * num_bins > 1.0: 119 | raise ValueError('Minimal bin height too large for the number of bins') 120 | 121 | widths = F.softmax(unnormalized_widths, dim=-1) 122 | widths = min_bin_width + (1 - min_bin_width * num_bins) * widths 123 | cumwidths = torch.cumsum(widths, dim=-1) 124 | cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0) 125 | cumwidths = (right - left) * cumwidths + left 126 | cumwidths[..., 0] = left 127 | cumwidths[..., -1] = right 128 | widths = cumwidths[..., 1:] - cumwidths[..., :-1] 129 | 130 | derivatives = min_derivative + F.softplus(unnormalized_derivatives) 131 | 132 | heights = F.softmax(unnormalized_heights, dim=-1) 133 | heights = min_bin_height + (1 - min_bin_height * num_bins) * heights 134 | cumheights = torch.cumsum(heights, dim=-1) 135 | cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0) 136 | cumheights = (top - bottom) * cumheights + bottom 137 | cumheights[..., 0] = bottom 138 | cumheights[..., -1] = top 139 | heights = cumheights[..., 1:] - cumheights[..., :-1] 140 | 141 | if inverse: 142 | bin_idx = searchsorted(cumheights, inputs)[..., None] 143 | else: 144 | bin_idx = searchsorted(cumwidths, inputs)[..., None] 145 | 146 | input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0] 147 | input_bin_widths = widths.gather(-1, bin_idx)[..., 0] 148 | 149 | input_cumheights = cumheights.gather(-1, bin_idx)[..., 0] 150 | delta = heights / widths 151 | input_delta = delta.gather(-1, bin_idx)[..., 0] 152 | 153 | input_derivatives = derivatives.gather(-1, bin_idx)[..., 0] 154 | input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0] 155 | 156 | input_heights = heights.gather(-1, bin_idx)[..., 0] 157 | 158 | if inverse: 159 | a = (((inputs - input_cumheights) * (input_derivatives 160 | + input_derivatives_plus_one 161 | - 2 * input_delta) 162 | + input_heights * (input_delta - input_derivatives))) 163 | b = (input_heights * input_derivatives 164 | - (inputs - input_cumheights) * (input_derivatives 165 | + input_derivatives_plus_one 166 | - 2 * input_delta)) 167 | c = - input_delta * (inputs - input_cumheights) 168 | 169 | discriminant = b.pow(2) - 4 * a * c 170 | assert (discriminant >= 0).all() 171 | 172 | root = (2 * c) / (-b - torch.sqrt(discriminant)) 173 | outputs = root * input_bin_widths + input_cumwidths 174 | 175 | theta_one_minus_theta = root * (1 - root) 176 | denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta) 177 | * theta_one_minus_theta) 178 | derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2) 179 | + 2 * input_delta * theta_one_minus_theta 180 | + input_derivatives * (1 - root).pow(2)) 181 | logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator) 182 | 183 | return outputs, -logabsdet 184 | else: 185 | theta = (inputs - input_cumwidths) / input_bin_widths 186 | theta_one_minus_theta = theta * (1 - theta) 187 | 188 | numerator = input_heights * (input_delta * theta.pow(2) 189 | + input_derivatives * theta_one_minus_theta) 190 | denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta) 191 | * theta_one_minus_theta) 192 | outputs = input_cumheights + numerator / denominator 193 | 194 | derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2) 195 | + 2 * input_delta * theta_one_minus_theta 196 | + input_derivatives * (1 - theta).pow(2)) 197 | logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator) 198 | 199 | return outputs, logabsdet 200 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | # from https://github.com/jaywalnut310/vits 2 | import os 3 | import sys 4 | import logging 5 | import subprocess 6 | import torch 7 | import numpy as np 8 | from omegaconf import OmegaConf 9 | from scipy.io.wavfile import read 10 | 11 | MATPLOTLIB_FLAG = False 12 | 13 | logging.basicConfig( 14 | stream=sys.stdout, 15 | level=logging.INFO, 16 | format='[%(levelname)s|%(filename)s:%(lineno)s][%(asctime)s] >>> %(message)s' 17 | ) 18 | logger = logging 19 | 20 | 21 | def load_checkpoint(checkpoint_path, rank=0, model_g=None, model_d=None, optim_g=None, optim_d=None): 22 | assert os.path.isfile(checkpoint_path) 23 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 24 | iteration = checkpoint_dict['iteration'] 25 | learning_rate = checkpoint_dict['learning_rate'] 26 | config = checkpoint_dict['config'] 27 | 28 | if model_g is not None: 29 | model_g, optim_g = load_model( 30 | model_g, 31 | checkpoint_dict['model_g'], 32 | optim_g, 33 | checkpoint_dict['optimizer_g']) 34 | 35 | if model_d is not None: 36 | model_d, optim_d = load_model( 37 | model_d, 38 | checkpoint_dict['model_d'], 39 | optim_d, 40 | checkpoint_dict['optimizer_d']) 41 | if rank == 0: 42 | logger.info( 43 | "Loaded checkpoint '{}' (iteration {})".format( 44 | checkpoint_path, 45 | iteration 46 | ) 47 | ) 48 | return model_g, model_d, optim_g, optim_d, learning_rate, iteration, config 49 | 50 | def load_checkpoint_diffsize(checkpoint_path, rank=0, model_g=None, model_d=None): 51 | assert os.path.isfile(checkpoint_path) 52 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 53 | iteration = checkpoint_dict['iteration'] 54 | learning_rate = checkpoint_dict['learning_rate'] 55 | config = checkpoint_dict['config'] 56 | 57 | if model_g is not None: 58 | model_g = load_model_diffsize( 59 | model_g, 60 | checkpoint_dict['model_g']) 61 | if model_d is not None: 62 | model_d = load_model_diffsize( 63 | model_d, 64 | checkpoint_dict['model_d']) 65 | if rank == 0: 66 | logger.info( 67 | "Loaded checkpoint '{}' (iteration {})".format( 68 | checkpoint_path, 69 | iteration 70 | ) 71 | ) 72 | del checkpoint_dict 73 | return model_g, model_d, learning_rate, iteration, config 74 | 75 | def load_model_diffsize(model, model_state_dict): 76 | if hasattr(model, 'module'): 77 | state_dict = model.module.state_dict() 78 | else: 79 | state_dict = model.state_dict() 80 | 81 | for k, v in model_state_dict.items(): 82 | if k in state_dict and state_dict[k].size() == v.size(): 83 | state_dict[k] = v 84 | 85 | if hasattr(model, 'module'): 86 | model.module.load_state_dict(state_dict, strict=False) 87 | else: 88 | model.load_state_dict(state_dict, strict=False) 89 | 90 | return model 91 | 92 | 93 | 94 | def load_model(model, model_state_dict, optim, optim_state_dict): 95 | if optim is not None: 96 | optim.load_state_dict(optim_state_dict) 97 | 98 | if hasattr(model, 'module'): 99 | state_dict = model.module.state_dict() 100 | else: 101 | state_dict = model.state_dict() 102 | 103 | for k, v in model_state_dict.items(): 104 | if k in state_dict and state_dict[k].size() == v.size(): 105 | state_dict[k] = v 106 | 107 | if hasattr(model, 'module'): 108 | model.module.load_state_dict(state_dict) 109 | else: 110 | model.load_state_dict(state_dict) 111 | 112 | return model, optim 113 | 114 | 115 | def save_checkpoint(net_g, optim_g, net_d, optim_d, hps, epoch, learning_rate, save_path): 116 | 117 | def get_state_dict(model): 118 | if hasattr(model, 'module'): 119 | state_dict = model.module.state_dict() 120 | else: 121 | state_dict = model.state_dict() 122 | return state_dict 123 | 124 | torch.save({'model_g': get_state_dict(net_g), 125 | 'model_d': get_state_dict(net_d), 126 | 'optimizer_g': optim_g.state_dict(), 127 | 'optimizer_d': optim_d.state_dict(), 128 | 'config': str(hps), 129 | 'iteration': epoch, 130 | 'learning_rate': learning_rate}, save_path) 131 | 132 | 133 | def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050): 134 | for k, v in scalars.items(): 135 | writer.add_scalar(k, v, global_step) 136 | for k, v in histograms.items(): 137 | writer.add_histogram(k, v, global_step) 138 | for k, v in images.items(): 139 | writer.add_image(k, v, global_step, dataformats='HWC') 140 | for k, v in audios.items(): 141 | writer.add_audio(k, v, global_step, audio_sampling_rate) 142 | 143 | 144 | def plot_spectrogram_to_numpy(spectrogram): 145 | global MATPLOTLIB_FLAG 146 | if not MATPLOTLIB_FLAG: 147 | import matplotlib 148 | matplotlib.use("Agg") 149 | MATPLOTLIB_FLAG = True 150 | mpl_logger = logging.getLogger('matplotlib') 151 | mpl_logger.setLevel(logging.WARNING) 152 | import matplotlib.pylab as plt 153 | import numpy as np 154 | 155 | fig, ax = plt.subplots(figsize=(10, 2)) 156 | im = ax.imshow(spectrogram, aspect="auto", origin="lower", 157 | interpolation='none') 158 | plt.colorbar(im, ax=ax) 159 | plt.xlabel("Frames") 160 | plt.ylabel("Channels") 161 | plt.tight_layout() 162 | 163 | fig.canvas.draw() 164 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 165 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 166 | plt.close() 167 | return data 168 | 169 | 170 | def plot_alignment_to_numpy(alignment, info=None): 171 | global MATPLOTLIB_FLAG 172 | if not MATPLOTLIB_FLAG: 173 | import matplotlib 174 | matplotlib.use("Agg") 175 | MATPLOTLIB_FLAG = True 176 | mpl_logger = logging.getLogger('matplotlib') 177 | mpl_logger.setLevel(logging.WARNING) 178 | import matplotlib.pylab as plt 179 | import numpy as np 180 | 181 | fig, ax = plt.subplots(figsize=(6, 4)) 182 | im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower', 183 | interpolation='none') 184 | fig.colorbar(im, ax=ax) 185 | xlabel = 'Decoder timestep' 186 | if info is not None: 187 | xlabel += '\n\n' + info 188 | plt.xlabel(xlabel) 189 | plt.ylabel('Encoder timestep') 190 | plt.tight_layout() 191 | 192 | fig.canvas.draw() 193 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 194 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 195 | plt.close() 196 | return data 197 | 198 | 199 | def load_wav_to_torch(full_path): 200 | sampling_rate, wav = read(full_path.replace("\\", "/")) ### modify .replace("\\", "/") ### 201 | 202 | if len(wav.shape) == 2: 203 | wav = wav[:, 0] 204 | 205 | if wav.dtype == np.int16: 206 | wav = wav / 32768.0 207 | elif wav.dtype == np.int32: 208 | wav = wav / 2147483648.0 209 | elif wav.dtype == np.uint8: 210 | wav = (wav - 128) / 128.0 211 | wav = wav.astype(np.float32) 212 | return torch.FloatTensor(wav), sampling_rate 213 | 214 | 215 | def load_filepaths_and_text(filename, split="|"): 216 | with open(filename, encoding='utf-8') as f: 217 | filepaths_and_text = [line.strip().split(split) for line in f] 218 | return filepaths_and_text 219 | 220 | 221 | def get_hparams(args, init=True): 222 | config = OmegaConf.load(args.config) 223 | hparams = HParams(**config) 224 | model_dir = "." + os.path.join(hparams.train.log_path, args.model) ### add "." + ### 225 | 226 | if not os.path.exists(model_dir): 227 | os.makedirs(model_dir) 228 | hparams.model_name = args.model 229 | hparams.model_dir = model_dir 230 | config_save_path = os.path.join(model_dir, "config.yaml") 231 | 232 | if init: 233 | OmegaConf.save(config, config_save_path) 234 | 235 | return hparams 236 | 237 | 238 | def get_hparams_from_file(config_path): 239 | config = OmegaConf.load(config_path) 240 | hparams = HParams(**config) 241 | return hparams 242 | 243 | 244 | def check_git_hash(model_dir): 245 | source_dir = os.path.dirname(os.path.realpath(__file__)) 246 | if not os.path.exists(os.path.join(source_dir, ".git")): 247 | logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format( 248 | source_dir 249 | )) 250 | return 251 | 252 | cur_hash = subprocess.getoutput("git rev-parse HEAD") 253 | 254 | path = os.path.join(model_dir, "githash") 255 | if os.path.exists(path): 256 | saved_hash = open(path).read() 257 | if saved_hash != cur_hash: 258 | logger.warn("git hash values are different. {}(saved) != {}(current)".format( 259 | saved_hash[:8], cur_hash[:8])) 260 | else: 261 | open(path, "w").write(cur_hash) 262 | 263 | 264 | def get_logger(model_dir, filename="train.log"): 265 | global logger 266 | logger = logging.getLogger(os.path.basename(model_dir)) 267 | logger.setLevel(logging.DEBUG) 268 | 269 | formatter = logging.Formatter( 270 | "%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s") 271 | if not os.path.exists(model_dir): 272 | os.makedirs(model_dir) 273 | h = logging.FileHandler(os.path.join(model_dir, filename)) 274 | h.setLevel(logging.DEBUG) 275 | h.setFormatter(formatter) 276 | logger.addHandler(h) 277 | return logger 278 | 279 | 280 | class HParams(): 281 | def __init__(self, **kwargs): 282 | for k, v in kwargs.items(): 283 | if type(v) == dict: 284 | v = HParams(**v) 285 | self[k] = v 286 | 287 | def keys(self): 288 | return self.__dict__.keys() 289 | 290 | def items(self): 291 | return self.__dict__.items() 292 | 293 | def values(self): 294 | return self.__dict__.values() 295 | 296 | def __len__(self): 297 | return len(self.__dict__) 298 | 299 | def __getitem__(self, key): 300 | return getattr(self, key) 301 | 302 | def __setitem__(self, key, value): 303 | return setattr(self, key, value) 304 | 305 | def __contains__(self, key): 306 | return key in self.__dict__ 307 | 308 | def __repr__(self): 309 | return self.__dict__.__repr__() 310 | -------------------------------------------------------------------------------- /yin.py: -------------------------------------------------------------------------------- 1 | # remove np from https://github.com/dhchoi99/NANSY/blob/master/models/yin.py 2 | # adapted from https://github.com/patriceguyot/Yin 3 | # https://github.com/NVIDIA/mellotron/blob/master/yin.py 4 | 5 | import torch 6 | import torch.nn.functional as F 7 | from math import log2, ceil 8 | 9 | 10 | def differenceFunction(x, N, tau_max): 11 | """ 12 | Compute difference function of data x. This corresponds to equation (6) in [1] 13 | This solution is implemented directly with torch rfft. 14 | 15 | 16 | :param x: audio data (Tensor) 17 | :param N: length of data 18 | :param tau_max: integration window size 19 | :return: difference function 20 | :rtype: list 21 | """ 22 | 23 | #x = np.array(x, np.float64) #[B,T] 24 | assert x.dim() == 2 25 | b, w = x.shape 26 | if w < tau_max: 27 | x = F.pad(x, (tau_max - w - (tau_max - w) // 2, (tau_max - w) // 2), 28 | 'constant', 29 | mode='reflect') 30 | w = tau_max 31 | #x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum())) 32 | x_cumsum = torch.cat( 33 | [torch.zeros([b, 1], device=x.device), (x * x).cumsum(dim=1)], dim=1) 34 | size = w + tau_max 35 | p2 = (size // 32).bit_length() 36 | #p2 = ceil(log2(size+1 // 32)) 37 | nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32) 38 | size_pad = min(n * 2**p2 for n in nice_numbers if n * 2**p2 >= size) 39 | fc = torch.fft.rfft(x, size_pad) #[B,F] 40 | conv = torch.fft.irfft(fc * fc.conj())[:, :tau_max] 41 | return x_cumsum[:, w:w - tau_max: 42 | -1] + x_cumsum[:, w] - x_cumsum[:, :tau_max] - 2 * conv 43 | 44 | 45 | def differenceFunction_np(x, N, tau_max): 46 | """ 47 | Compute difference function of data x. This corresponds to equation (6) in [1] 48 | This solution is implemented directly with Numpy fft. 49 | 50 | 51 | :param x: audio data 52 | :param N: length of data 53 | :param tau_max: integration window size 54 | :return: difference function 55 | :rtype: list 56 | """ 57 | 58 | x = np.array(x, np.float64) 59 | w = x.size 60 | tau_max = min(tau_max, w) 61 | x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum())) 62 | size = w + tau_max 63 | p2 = (size // 32).bit_length() 64 | nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32) 65 | size_pad = min(x * 2**p2 for x in nice_numbers if x * 2**p2 >= size) 66 | fc = np.fft.rfft(x, size_pad) 67 | conv = np.fft.irfft(fc * fc.conjugate())[:tau_max] 68 | return x_cumsum[w:w - 69 | tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv 70 | 71 | 72 | def cumulativeMeanNormalizedDifferenceFunction(df, N, eps=1e-8): 73 | """ 74 | Compute cumulative mean normalized difference function (CMND). 75 | 76 | This corresponds to equation (8) in [1] 77 | 78 | :param df: Difference function 79 | :param N: length of data 80 | :return: cumulative mean normalized difference function 81 | :rtype: list 82 | """ 83 | #np.seterr(divide='ignore', invalid='ignore') 84 | # scipy method, assert df>0 for all element 85 | # cmndf = df[1:] * np.asarray(list(range(1, N))) / (np.cumsum(df[1:]).astype(float) + eps) 86 | B, _ = df.shape 87 | cmndf = df[:, 88 | 1:] * torch.arange(1, N, device=df.device, dtype=df.dtype).view( 89 | 1, -1) / (df[:, 1:].cumsum(dim=-1) + eps) 90 | return torch.cat( 91 | [torch.ones([B, 1], device=df.device, dtype=df.dtype), cmndf], dim=-1) 92 | 93 | 94 | def differenceFunctionTorch(xs: torch.Tensor, N, tau_max) -> torch.Tensor: 95 | """pytorch backend batch-wise differenceFunction 96 | has 1e-4 level error with input shape of (32, 22050*1.5) 97 | Args: 98 | xs: 99 | N: 100 | tau_max: 101 | 102 | Returns: 103 | 104 | """ 105 | xs = xs.double() 106 | w = xs.shape[-1] 107 | tau_max = min(tau_max, w) 108 | zeros = torch.zeros((xs.shape[0], 1)) 109 | x_cumsum = torch.cat((torch.zeros((xs.shape[0], 1), device=xs.device), 110 | (xs * xs).cumsum(dim=-1, dtype=torch.double)), 111 | dim=-1) # B x w 112 | size = w + tau_max 113 | p2 = (size // 32).bit_length() 114 | nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32) 115 | size_pad = min(x * 2**p2 for x in nice_numbers if x * 2**p2 >= size) 116 | 117 | fcs = torch.fft.rfft(xs, n=size_pad, dim=-1) 118 | convs = torch.fft.irfft(fcs * fcs.conj())[:, :tau_max] 119 | y1 = torch.flip(x_cumsum[:, w - tau_max + 1:w + 1], dims=[-1]) 120 | y = y1 + x_cumsum[:, w].unsqueeze(-1) - x_cumsum[:, :tau_max] - 2 * convs 121 | return y 122 | 123 | 124 | def cumulativeMeanNormalizedDifferenceFunctionTorch(dfs: torch.Tensor, 125 | N, 126 | eps=1e-8) -> torch.Tensor: 127 | arange = torch.arange(1, N, device=dfs.device, dtype=torch.float64) 128 | cumsum = torch.cumsum(dfs[:, 1:], dim=-1, 129 | dtype=torch.float64).to(dfs.device) 130 | 131 | cmndfs = dfs[:, 1:] * arange / (cumsum + eps) 132 | cmndfs = torch.cat( 133 | (torch.ones(cmndfs.shape[0], 1, device=dfs.device), cmndfs), dim=-1) 134 | return cmndfs 135 | 136 | 137 | if __name__ == '__main__': 138 | wav = torch.randn(32, int(22050 * 1.5)).cuda() 139 | wav_numpy = wav.detach().cpu().numpy() 140 | x = wav_numpy[0] 141 | 142 | w_len = 2048 143 | w_step = 256 144 | tau_max = 2048 145 | W = 2048 146 | 147 | startFrames = list(range(0, x.shape[-1] - w_len, w_step)) 148 | startFrames = np.asarray(startFrames) 149 | # times = startFrames / sr 150 | frames = [x[..., t:t + W] for t in startFrames] 151 | frames = np.asarray(frames) 152 | frames_torch = torch.from_numpy(frames).cuda() 153 | 154 | cmndfs0 = [] 155 | for idx, frame in enumerate(frames): 156 | df = differenceFunction(frame, frame.shape[-1], tau_max) 157 | cmndf = cumulativeMeanNormalizedDifferenceFunction(df, tau_max) 158 | cmndfs0.append(cmndf) 159 | cmndfs0 = np.asarray(cmndfs0) 160 | 161 | dfs = differenceFunctionTorch(frames_torch, frames_torch.shape[-1], 162 | tau_max) 163 | cmndfs1 = cumulativeMeanNormalizedDifferenceFunctionTorch( 164 | dfs, tau_max).detach().cpu().numpy() 165 | print(cmndfs0.shape, cmndfs1.shape) 166 | print(np.sum(np.abs(cmndfs0 - cmndfs1))) 167 | --------------------------------------------------------------------------------