├── LICENSE
├── README.md
├── analysis.py
├── app.py
├── asset
    ├── Yingram.png
    └── overall.png
├── attentions.py
├── commons.py
├── configs
    ├── config_en.yaml
    ├── config_ja_22050.yaml
    └── config_ja_44100.yaml
├── data_utils.py
├── dataset
    └── preprocess.py
├── filelists
    ├── vctk_test_g2p.txt
    ├── vctk_train_g2p.txt
    └── vctk_val_g2p.txt
├── inference.py
├── losses.py
├── mel_processing.py
├── metadata_cleaners.py
├── models.py
├── modules.py
├── monotonic_align
    ├── __init__.py
    ├── core.pyx
    └── setup.py
├── pqmf.py
├── requirements.txt
├── text
    ├── __init__.py
    ├── cleaners.py
    ├── numbers.py
    └── symbols.py
├── train.py
├── transforms.py
├── utils.py
└── yin.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 ㌧㌧
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # PITS(44100Hz 日本語対応版)
  2 | **PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS**
  3 | 
  4 | このリポジトリは、 44100Hzの日本語音声を学習および出力できるように編集した[PITS](https://github.com/anonymous-pits/pits)です。初期状態ではベクトル量子化処理無しのPITS(A+D)版ですが、models.pyのfor Q optionと記載されている部分のコードを数行変更すれば、PITS(A+D+Q)版へと変更が可能です。
  5 | 
  6 | ![overall](asset/overall.png) 
  7 | 
  8 | ## 1. 環境構築
  9 | Anacondaによる実行環境構築を想定します。
 10 | 
 11 | 1. Anacondaで"PITS"という名前の仮想環境を作成する。[y]or nを聞かれたら[y]を入力する。
 12 |     ```sh
 13 |     conda create -n PITS python=3.8     
 14 |     ```
 15 | 1. 仮想環境を有効化する。
 16 |     ```sh
 17 |     conda activate PITS 
 18 |     ```
 19 | 1. このリポジトリをクローンする（もしくはDownload Zipでダウンロードする）
 20 |     ```sh
 21 |     git clone https://github.com/tonnetonne814/PITS-44100-Ja.git
 22 |     cd PITS-44100-Ja # フォルダ移動
 23 |     ```
 24 | 1. [PyTorch.org](https://pytorch.org/)より、自分の環境に合わせてPyTorchをインストールする
 25 |     ```sh
 26 |     # OS=Linux, CUDA=11.7 の例
 27 |     pip3 install torch torchvision torchaudio 
 28 |     ```
 29 | 1. その他、必要なパッケージをインストールする。
 30 |     ```sh
 31 |     pip install -r requirements.txt 
 32 |     ```
 33 | 1. Monotonoic Alignment Searchをビルドする。
 34 |     ```sh
 35 |     cd monotonic_align
 36 |     mkdir monotonic_align
 37 |     python setup.py build_ext --inplace
 38 |     ```
 39 | ## 2. データセットの準備
 40 | [JVSコーパス](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)による、parallel100(話者間で共通する読み上げ音声 100 発話)、及びnonpara30(話者間で全く異なる読み上げ音声 30 発話)の学習を想定します。
 41 | 
 42 | 1. [こちら](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)からJVSコーパスをダウンロード&解凍する。
 43 | 1. 発話音声ファイルのサンプリングレートを44100Hzに変更する。path/to/〜となっている部分は適宜変更する。
 44 |     ```sh
 45 |     python3 ./dataset/preprocess.py --folder_path path/to/jvs_ver1/ --sampling_rate 44100
 46 |     ```
 47 |     
 48 |     > ⚠path/to/jvs_ver1/ には、jvsコーパスの各話者の発話フォルダ[jvs001,jvs002, ... ,jvs100]が格納されているフォルダパスを指定する。
 49 | 
 50 | ## 3. [configs](configs)フォルダ内のjsonを編集
 51 | 主要なパラメータを下表に記載します。
 52 | 
 53 | | 分類  | パラメータ名      | 説明                                                                                       |
 54 | |:-----:|:-----------------:|:------------------------------------------------------------------------------------------:|
 55 | | train | log_interval      | 指定ステップ毎にロスを算出し記録する                                                       |
 56 | | train | eval_interval     | 指定ステップ毎にモデル評価を行う                                                           |
 57 | | train | save_interval     | 指定ステップ毎にモデル保存を行う                                                           |
 58 | | train | epochs            | 学習データ全体の学習回数                                                                   |
 59 | | train | batch_size        | 一回のパラメータ更新に使用する学習データの数                                               |
 60 | | data  | data_path         | jvs話者フォルダが格納されているフォルダパス（preprocess.pyで使用したpath/to/jvs_ver1/の値）|
 61 | | data  | training_files    | 学習用filelistのテキストパス                                                               |
 62 | | data  | validation_files  | 検証用filelistのテキストパス                                                               |
 63 | | data  | speakers          | 話者名のリスト                                                               |
 64 | 
 65 | config_ja_44100.yaml内の、data部分のdata_pathの値を、「2. データセットの準備」部分のpreprocess.pyに使用したjvsフォルダパスに書き換えます。
 66 | 
 67 | ## 4. 学習
 68 | 44100HzでのPITS(A+D)版の学習を想定します。Terminalに以下を入力し、学習を開始する。path/to/〜となっている部分は適宜変更する。
 69 | ```sh
 70 | python3 train.py --config ./configs/config_ja_44100.yaml --model PITS_A+D 
 71 | # 途中から学習を開始する場合は、--resume path/to/checkpoint.pt を追加する 
 72 | ```
 73 | このとき、nonpara30について、書き起こし文(transcripts_utf8.txt)と実際に格納されているwavファイルが一致しないものは除外する処理が入っています。
 74 | 
 75 | 学習経過はターミナルにも表示されるが、tensorboardを用いて確認することで、生成音声の視聴や、スペクトログラムやYingramや各ロス遷移を目視で確認することができる。
 76 | ```sh
 77 | tensorboard --logdir ./logs/PITS_A+D/
 78 | ```
 79 | 
 80 | ## 5. 推論
 81 | 推論を行う場合は、Terminalに以下を入力する。path/to/〜となっている部分は適宜変更する。
 82 | ```sh
 83 | python3 inference.py --config path/to/config.yaml --model PITS_A+D --model_path path/to/checkpoint.pth
 84 | ```
 85 | Terminalにて、話者名や、読み上げテキスト、ピッチシフト数（整数）を入力することで、音声が生成さされます。音声は自動的に再生され、infer_logsフォルダ（存在しない場合は自動作成）に保存されます。
 86 | 
 87 | ## 6.ファインチューニング
 88 | 1. ファインチューニング用のfilelist等を作成する必要があります。
 89 | ./filelist/*.txtの中身を参考に作成してください。形式としては、以下のようになっています。
 90 |     ```sh
 91 |     話者フォルダからwavファイルまでのパス　|　発話テキスト　|　話者名
 92 |     ```
 93 | 2. config.yaml内のtraining_filesと、validation_filesのパスを、作成したリストに書き換えます。
 94 | 3. config.yaml内のspeakers部分に、話者名を記載（追加でも変更でも良い）する。
 95 | 5. Terminalに以下を入力し、ファインチューニングを実行する。path/to/〜となっている部分は適宜変更する。
 96 |     ```sh
 97 |     python3 train.py --config path/to/config.yaml --model PITS_A+D_finetune --force_resume path/to/checkpoint.pt
 98 |     ```
 99 | 
100 | ## 事前学習モデル
101 | JVSコーパスを22050Hz版は150Epoch、44100Hz版は250Epoch程度学習した程度のモデルです。喋ったりピッチを変更する程度には問題ないとは思いますが、**学習不足である**ことに注意して使用してください。
102 | 
103 | **ダウンロード** : 
104 | [PITS(A+D)22050Hz版](https://drive.google.com/file/d/18eOyh8yEqryYTTssA1yUetbs66SbtlrM/view?usp=share_link) [PITS(A+D)44100版](https://drive.google.com/file/d/1AfSUHXkZ20_i_zwN8-f6I122pJd3Zb32/view?usp=share_link)
105 | 
106 | ## 付録(Yingramの可視化)
107 | >Yingram, an acoustic feature inspired by [YIN algorithm [22]](http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf) that captures pitch information including harmonics.Yingram is designed to address the limitations of extracting f0, which is not well-defined in some cases [[23]](https://arxiv.org/abs/1910.10235), and the Yingram-based model shows better preference than the f0-based model [[16]](https://arxiv.org/abs/2110.14513).
108 | >> DeepL : Yingramは、[YINアルゴリズム[22]](http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf)にインスパイアされた音響特徴で、倍音を含むピッチ情報を捉えます。Yingramは、場合によってはうまく定義できないf0を抽出する限界に対処するために設計され[[23]](https://arxiv.org/abs/1910.10235)、Yingramベースのモデルはf0ベースのモデルよりも優れた選好性を示します[[16]](https://arxiv.org/abs/2110.14513)。
109 | 
110 | ![overall](asset/Yingram.png) 
111 | 
112 | ## 参考文献
113 | - Official PITS Implementation; https://github.com/anonymous-pits/pits
114 | - Official VITS Implementation: https://github.com/jaywalnut310/vits
115 | - NANSY Implementation from dhchoi99: https://github.com/dhchoi99/NANSY
116 | - Official Avocodo Implementation: https://github.com/ncsoft/avocodo
117 | - Official PhaseAug Implementation: https://github.com/mindslab-ai/phaseaug
118 | - Tacotron Implementation from keithito: https://github.com/keithito/tacotron
119 | - CSTR VCTK Corpus (version 0.92): https://datashare.ed.ac.uk/handle/10283/3443
120 | - G2P for demo, g2p\_en from Kyubyong: https://github.com/Kyubyong/g2p
121 | - ESPNet:end-to-end speech processing toolkit: https://github.com/espnet/espnet


--------------------------------------------------------------------------------
/analysis.py:
--------------------------------------------------------------------------------
  1 | # modified from https://github.com/dhchoi99/NANSY
  2 | # We have modified the implementation of dhchoi99 to be fully differentiable.
  3 | import math
  4 | import torch
  5 | from yin import *
  6 | 
  7 | 
  8 | class Pitch(torch.nn.Module):
  9 | 
 10 |     def __init__(
 11 |             self, 
 12 |             sr=22050, 
 13 |             w_step=256, 
 14 |             W=2048, 
 15 |             tau_max=2048, 
 16 |             midi_start=5, 
 17 |             midi_end=85, 
 18 |             octave_range=12):
 19 |         super(Pitch, self).__init__()
 20 |         self.sr = sr
 21 |         self.w_step = w_step
 22 |         self.W = W
 23 |         self.tau_max = tau_max
 24 |         self.unfold = torch.nn.Unfold((1, self.W),
 25 |                                       1,
 26 |                                       0,
 27 |                                       stride=(1, self.w_step))
 28 |         midis = list(range(midi_start, midi_end))
 29 |         self.len_midis = len(midis)
 30 |         c_ms = torch.tensor([self.midi_to_lag(m, octave_range) for m in midis])
 31 |         self.register_buffer('c_ms', c_ms)
 32 |         self.register_buffer('c_ms_ceil', torch.ceil(self.c_ms).long())
 33 |         self.register_buffer('c_ms_floor', torch.floor(self.c_ms).long())
 34 | 
 35 |     def midi_to_lag(self, m: int, octave_range: float = 12):
 36 |         """converts midi-to-lag, eq. (4)
 37 | 
 38 |         Args:
 39 |             m: midi
 40 |             sr: sample_rate
 41 |             octave_range:
 42 | 
 43 |         Returns:
 44 |             lag: time lag(tau, c(m)) calculated from midi, eq. (4)
 45 | 
 46 |         """
 47 |         f = 440 * math.pow(2, (m - 69) / octave_range)
 48 |         lag = self.sr / f
 49 |         return lag
 50 | 
 51 |     def yingram_from_cmndf(self, cmndfs: torch.Tensor) -> torch.Tensor:
 52 |         """ yingram calculator from cMNDFs(cumulative Mean Normalized Difference Functions)
 53 | 
 54 |         Args:
 55 |             cmndfs: torch.Tensor
 56 |                 calculated cumulative mean normalized difference function
 57 |                 for details, see models/yin.py or eq. (1) and (2)
 58 |             ms: list of midi(int)
 59 |             sr: sampling rate
 60 | 
 61 |         Returns:
 62 |             y:
 63 |                 calculated batch yingram
 64 | 
 65 | 
 66 |         """
 67 |         #c_ms = np.asarray([Pitch.midi_to_lag(m, sr) for m in ms])
 68 |         #c_ms = torch.from_numpy(c_ms).to(cmndfs.device)
 69 | 
 70 |         y = (cmndfs[:, self.c_ms_ceil] -
 71 |              cmndfs[:, self.c_ms_floor]) / (self.c_ms_ceil - self.c_ms_floor).unsqueeze(0) * (
 72 |                  self.c_ms - self.c_ms_floor).unsqueeze(0) + cmndfs[:, self.c_ms_floor]
 73 |         return y
 74 | 
 75 |     def yingram(self, x: torch.Tensor):
 76 |         """calculates yingram from raw audio (multi segment)
 77 | 
 78 |         Args:
 79 |             x: raw audio, torch.Tensor of shape (t)
 80 |             W: yingram Window Size
 81 |             tau_max:
 82 |             sr: sampling rate
 83 |             w_step: yingram bin step size
 84 | 
 85 |         Returns:
 86 |             yingram: yingram. torch.Tensor of shape (80 x t')
 87 | 
 88 |         """
 89 |         # x.shape: t -> B,T, B,T = x.shape
 90 |         B, T = x.shape
 91 |         w_len = self.W
 92 | 
 93 | 
 94 |         frames = self.unfold(x.view(B, 1, 1, T))
 95 |         frames = frames.permute(0, 2,
 96 |                                 1).contiguous().view(-1,
 97 |                                                      self.W)  #[B* frames, W]
 98 |         # If not using gpu, or torch not compatible, implemented numpy batch function is still fine
 99 |         dfs = differenceFunctionTorch(frames, frames.shape[-1], self.tau_max)
100 |         cmndfs = cumulativeMeanNormalizedDifferenceFunctionTorch(
101 |             dfs, self.tau_max)
102 |         yingram = self.yingram_from_cmndf(cmndfs)  #[B*frames,F]
103 |         yingram = yingram.view(B, -1, self.len_midis).permute(0, 2,
104 |                                                               1)  # [B,F,T]
105 |         return yingram
106 | 
107 |     def crop_scope(self, x, yin_start,
108 |                    scope_shift):  # x: tensor [B,C,T] #scope_shift: tensor [B]
109 |         return torch.stack([
110 |             x[i, yin_start + scope_shift[i]:yin_start + self.yin_scope +
111 |               scope_shift[i], :] for i in range(x.shape[0])
112 |         ],
113 |                            dim=0)
114 | 
115 | 
116 | if __name__ == '__main__':
117 |     import torch
118 |     import librosa as rosa
119 |     import matplotlib.pyplot as plt
120 |     wav = torch.tensor(rosa.load('LJ001-0002.wav', sr=22050,
121 |                                  mono=True)[0]).unsqueeze(0)
122 |     #    wav = torch.randn(1,40965)
123 | 
124 |     wav = torch.nn.functional.pad(wav, (0, (-wav.shape[1]) % 256))
125 |     #    wav = wav[#:,:8096]
126 |     print(wav.shape)
127 |     pitch = Pitch()
128 | 
129 |     with torch.no_grad():
130 |         ps = pitch.yingram(torch.nn.functional.pad(wav, (1024, 1024)))
131 |         ps = torch.nn.functional.pad(ps, (0, 0, 8, 8), mode='replicate')
132 |         print(ps.shape)
133 |         spec = torch.stft(wav, 1024, 256, return_complex=False)
134 |         print(spec.shape)
135 |         plt.subplot(2, 1, 1)
136 |         plt.pcolor(ps[0].numpy(), cmap='magma')
137 |         plt.colorbar()
138 |         plt.subplot(2, 1, 2)
139 |         plt.pcolor(ps[0][15:65, :].numpy(), cmap='magma')
140 |         plt.colorbar()
141 |         plt.show()
142 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import gradio as gr
  2 | import argparse
  3 | import torch
  4 | import commons
  5 | import utils
  6 | from models import (
  7 |     SynthesizerTrn, )
  8 | 
  9 | from text.symbols import symbol_len, lang_to_dict
 10 | 
 11 | # we use Kyubyong/g2p for demo instead of our internal g2p
 12 | # https://github.com/Kyubyong/g2p
 13 | from g2p_en import G2p
 14 | import re
 15 | 
 16 | _symbol_to_id = lang_to_dict("en_US")
 17 | 
 18 | class GradioApp:
 19 | 
 20 |     def __init__(self, args):
 21 |         self.hps = utils.get_hparams_from_file(args.config)
 22 |         self.device = "cpu"
 23 |         self.net_g = SynthesizerTrn(symbol_len(self.hps.data.languages),
 24 |                                     self.hps.data.filter_length // 2 + 1,
 25 |                                     self.hps.train.segment_size //
 26 |                                     self.hps.data.hop_length,
 27 |                                     midi_start=-5,
 28 |                                     midi_end=75,
 29 |                                     octave_range=24,
 30 |                                     n_speakers=len(self.hps.data.speakers),
 31 |                                     **self.hps.model).to(self.device)
 32 |         _ = self.net_g.eval()
 33 |         _ = utils.load_checkpoint(args.checkpoint_path, model_g=self.net_g)
 34 |         self.g2p = G2p()
 35 |         self.interface = self._gradio_interface()
 36 | 
 37 |     def get_phoneme(self, text):
 38 |         phones = [re.sub("[0-9]", "", p) for p in self.g2p(text)]
 39 |         tone = [0 for p in phones]
 40 |         if self.hps.data.add_blank:
 41 |             text_norm = [_symbol_to_id[symbol] for symbol in phones]
 42 |             text_norm = commons.intersperse(text_norm, 0)
 43 |             tone = commons.intersperse(tone, 0)
 44 |         else:
 45 |             text_norm = phones
 46 |         text_norm = torch.LongTensor(text_norm)
 47 |         tone = torch.LongTensor(tone)
 48 |         return text_norm, tone, phones
 49 |     
 50 |     def inference(self, text, speaker_id_val, seed, scope_shift, duration):
 51 |         seed = int(seed)
 52 |         scope_shift = int(scope_shift)
 53 |         torch.manual_seed(seed)
 54 |         text_norm, tone, phones = self.get_phoneme(text)
 55 |         x_tst = text_norm.to(self.device).unsqueeze(0)
 56 |         t_tst = tone.to(self.device).unsqueeze(0)
 57 |         x_tst_lengths = torch.LongTensor([text_norm.size(0)]).to(self.device)
 58 |         speaker_id = torch.LongTensor([speaker_id_val]).to(self.device)
 59 |         decoder_inputs,*_ = self.net_g.infer_pre_decoder(
 60 |                                            x_tst,
 61 |                                            t_tst,
 62 |                                            x_tst_lengths,
 63 |                                            sid=speaker_id,
 64 |                                            noise_scale=0.667,
 65 |                                            noise_scale_w=0.8,
 66 |                                            length_scale=duration,
 67 |                                            scope_shift=scope_shift)
 68 |         audio = self.net_g.infer_decode_chunk(
 69 |             decoder_inputs, sid=speaker_id)[0, 0].data.cpu().float().numpy()
 70 |         del decoder_inputs,  
 71 |         return phones, (self.hps.data.sampling_rate, audio)
 72 | 
 73 | 
 74 |     def _gradio_interface(self):
 75 |         title = "PITS Demo"
 76 |         self.inputs = [
 77 |             gr.Textbox(label="Text (150 words limitation)",
 78 |                        value="This is demo page.",
 79 |                        elem_id="tts-input"),
 80 |             gr.Dropdown(list(self.hps.data.speakers),
 81 |                         value="p225",
 82 |                         label="Speaker Identity",
 83 |                         type="index"),
 84 |             gr.Slider(0, 65536, value=0, step=1, label="random seed"),
 85 |             gr.Slider(-15, 15, value=0, step=1, label="scope-shift"),
 86 |             gr.Slider(0.5, 2., value=1., step=0.1,
 87 |                       label="duration multiplier"),
 88 |         ]
 89 |         self.outputs = [
 90 |             gr.Textbox(label="Phonemes"),
 91 |             gr.Audio(type="numpy", label="Output audio")
 92 |         ]
 93 |         description = "Welcome to the Gradio demo for PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS.\n In this demo, we utilize an open-source G2P library (g2p_en) with stress removing, instead of our internal G2P.\n You can fix the latent z by controlling random seed.\n You can shift the pitch scope, but please note that this is opposite to pitch-shift. In addition, it is cropped from fixed z so please check pitch-controllability by comparing with normal synthesis.\n Thank you for trying out our PITS demo!"
 94 |         article = "Github:https://github.com/anonymous-pits/pits \n Our current preprint contains several errors. Please wait for next update."
 95 |         examples = [["This is a demo page of the PITS."],["I love hugging face."]]
 96 |         return gr.Interface(
 97 |             fn=self.inference,
 98 |             inputs=self.inputs,
 99 |             outputs=self.outputs,
100 |             title=title,
101 |             description=description,
102 |             article=article,
103 |             cache_examples=False,
104 |             examples=examples,
105 |         )
106 | 
107 |     def launch(self):
108 |         return self.interface.launch(share=False)
109 | 
110 | 
111 | def parsearg():
112 |     parser = argparse.ArgumentParser()
113 |     parser.add_argument('-c',
114 |                         '--config',
115 |                         type=str,
116 |                         default="./configs/config_en.yaml",
117 |                         help='Path to configuration file')
118 |     parser.add_argument('-m',
119 |                         '--model',
120 |                         type=str,
121 |                         default='PITS',
122 |                         help='Model name')
123 |     parser.add_argument('-r',
124 |                         '--checkpoint_path',
125 |                         type=str,
126 |                         default='./logs/pits_vctk_AD_3000.pth',
127 |                         help='Path to checkpoint for resume')
128 |     parser.add_argument('-f',
129 |                         '--force_resume',
130 |                         type=str,
131 |                         help='Path to checkpoint for force resume')
132 |     parser.add_argument('-d',
133 |                         '--dir',
134 |                         type=str,
135 |                         default='/DATA/audio/pits_samples',
136 |                         help='root dir')
137 |     args = parser.parse_args()
138 |     return args
139 | 
140 | if __name__ == "__main__":
141 |     args = parsearg()
142 |     app = GradioApp(args)
143 |     app.launch()
144 | 


--------------------------------------------------------------------------------
/asset/Yingram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tonnetonne814/PITS-44100-Ja/8283da256f0fd43394ccb6f45497f59021e29e41/asset/Yingram.png


--------------------------------------------------------------------------------
/asset/overall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tonnetonne814/PITS-44100-Ja/8283da256f0fd43394ccb6f45497f59021e29e41/asset/overall.png


--------------------------------------------------------------------------------
/attentions.py:
--------------------------------------------------------------------------------
  1 | # from https://github.com/jaywalnut310/vits
  2 | import math
  3 | import torch
  4 | from torch import nn
  5 | from torch.nn import functional as F
  6 | 
  7 | import commons
  8 | from modules import LayerNorm
  9 | 
 10 | 
 11 | class Encoder(nn.Module):
 12 |     def __init__(
 13 |         self, 
 14 |         hidden_channels, 
 15 |         filter_channels, 
 16 |         n_heads, 
 17 |         n_layers, 
 18 |         kernel_size=1, 
 19 |         p_dropout=0., 
 20 |         window_size=4, 
 21 |         **kwargs
 22 |     ):
 23 |         super().__init__()
 24 |         self.hidden_channels = hidden_channels
 25 |         self.filter_channels = filter_channels
 26 |         self.n_heads = n_heads
 27 |         self.n_layers = n_layers
 28 |         self.kernel_size = kernel_size
 29 |         self.p_dropout = p_dropout
 30 |         self.window_size = window_size
 31 | 
 32 |         self.drop = nn.Dropout(p_dropout)
 33 |         self.attn_layers = nn.ModuleList()
 34 |         self.norm_layers_1 = nn.ModuleList()
 35 |         self.ffn_layers = nn.ModuleList()
 36 |         self.norm_layers_2 = nn.ModuleList()
 37 |         for i in range(self.n_layers):
 38 |             self.attn_layers.append(
 39 |                 MultiHeadAttention(
 40 |                     hidden_channels, 
 41 |                     hidden_channels, 
 42 |                     n_heads, 
 43 |                     p_dropout=p_dropout, 
 44 |                     window_size=window_size
 45 |                 )
 46 |             )
 47 |             self.norm_layers_1.append(LayerNorm(hidden_channels))
 48 |             self.ffn_layers.append(
 49 |                 FFN(
 50 |                     hidden_channels, 
 51 |                     hidden_channels, 
 52 |                     filter_channels, 
 53 |                     kernel_size, 
 54 |                     p_dropout=p_dropout
 55 |                 )
 56 |             )
 57 |             self.norm_layers_2.append(LayerNorm(hidden_channels))
 58 | 
 59 |     def forward(self, x, x_mask):
 60 |         attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
 61 |         x = x * x_mask
 62 |         for i in range(self.n_layers):
 63 |             y = self.attn_layers[i](x, x, attn_mask)
 64 |             y = self.drop(y)
 65 |             x = self.norm_layers_1[i](x + y)
 66 | 
 67 |             y = self.ffn_layers[i](x, x_mask)
 68 |             y = self.drop(y)
 69 |             x = self.norm_layers_2[i](x + y)
 70 |         x = x * x_mask
 71 |         return x
 72 | 
 73 | 
 74 | class Decoder(nn.Module):
 75 |     def __init__(
 76 |         self, 
 77 |         hidden_channels, 
 78 |         filter_channels, 
 79 |         n_heads, 
 80 |         n_layers, 
 81 |         kernel_size=1, 
 82 |         p_dropout=0., 
 83 |         proximal_bias=False, 
 84 |         proximal_init=True, 
 85 |         **kwargs
 86 |     ):
 87 |         super().__init__()
 88 |         self.hidden_channels = hidden_channels
 89 |         self.filter_channels = filter_channels
 90 |         self.n_heads = n_heads
 91 |         self.n_layers = n_layers
 92 |         self.kernel_size = kernel_size
 93 |         self.p_dropout = p_dropout
 94 |         self.proximal_bias = proximal_bias
 95 |         self.proximal_init = proximal_init
 96 | 
 97 |         self.drop = nn.Dropout(p_dropout)
 98 |         self.self_attn_layers = nn.ModuleList()
 99 |         self.norm_layers_0 = nn.ModuleList()
100 |         self.encdec_attn_layers = nn.ModuleList()
101 |         self.norm_layers_1 = nn.ModuleList()
102 |         self.ffn_layers = nn.ModuleList()
103 |         self.norm_layers_2 = nn.ModuleList()
104 |         for i in range(self.n_layers):
105 |             self.self_attn_layers.append(
106 |                 MultiHeadAttention(
107 |                     hidden_channels, 
108 |                     hidden_channels, 
109 |                     n_heads,
110 |                     p_dropout=p_dropout, 
111 |                     proximal_bias=proximal_bias, 
112 |                     proximal_init=proximal_init
113 |                 )
114 |             )
115 |             self.norm_layers_0.append(LayerNorm(hidden_channels))
116 |             self.encdec_attn_layers.append(
117 |                 MultiHeadAttention(
118 |                     hidden_channels, 
119 |                     hidden_channels, 
120 |                     n_heads, 
121 |                     p_dropout=p_dropout
122 |                 )
123 |             )
124 |             self.norm_layers_1.append(LayerNorm(hidden_channels))
125 |             self.ffn_layers.append(
126 |                 FFN(
127 |                     hidden_channels, 
128 |                     hidden_channels,
129 |                     filter_channels, 
130 |                     kernel_size, 
131 |                     p_dropout=p_dropout, 
132 |                     causal=True
133 |                 )
134 |             )
135 |             self.norm_layers_2.append(LayerNorm(hidden_channels))
136 | 
137 |     def forward(self, x, x_mask, h, h_mask):
138 |         """
139 |         x: decoder input
140 |         h: encoder output
141 |         """
142 |         self_attn_mask = commons.subsequent_mask(
143 |             x_mask.size(2)
144 |         ).to(device=x.device, dtype=x.dtype)
145 |         encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
146 |         x = x * x_mask
147 |         for i in range(self.n_layers):
148 |             y = self.self_attn_layers[i](x, x, self_attn_mask)
149 |             y = self.drop(y)
150 |             x = self.norm_layers_0[i](x + y)
151 | 
152 |             y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
153 |             y = self.drop(y)
154 |             x = self.norm_layers_1[i](x + y)
155 | 
156 |             y = self.ffn_layers[i](x, x_mask)
157 |             y = self.drop(y)
158 |             x = self.norm_layers_2[i](x + y)
159 |         x = x * x_mask
160 |         return x
161 | 
162 | 
163 | class MultiHeadAttention(nn.Module):
164 |     def __init__(
165 |         self, 
166 |         channels, 
167 |         out_channels, 
168 |         n_heads, 
169 |         p_dropout=0., 
170 |         window_size=None, 
171 |         heads_share=True, 
172 |         block_length=None, 
173 |         proximal_bias=False, 
174 |         proximal_init=False
175 |     ):
176 |         super().__init__()
177 |         assert channels % n_heads == 0
178 | 
179 |         self.channels = channels
180 |         self.out_channels = out_channels
181 |         self.n_heads = n_heads
182 |         self.p_dropout = p_dropout
183 |         self.window_size = window_size
184 |         self.heads_share = heads_share
185 |         self.block_length = block_length
186 |         self.proximal_bias = proximal_bias
187 |         self.proximal_init = proximal_init
188 |         self.attn = None
189 | 
190 |         self.k_channels = channels // n_heads
191 |         self.conv_q = nn.Conv1d(channels, channels, 1)
192 |         self.conv_k = nn.Conv1d(channels, channels, 1)
193 |         self.conv_v = nn.Conv1d(channels, channels, 1)
194 |         self.conv_o = nn.Conv1d(channels, out_channels, 1)
195 |         self.drop = nn.Dropout(p_dropout)
196 | 
197 |         if window_size is not None:
198 |             n_heads_rel = 1 if heads_share else n_heads
199 |             rel_stddev = self.k_channels**-0.5
200 |             self.emb_rel_k = nn.Parameter(torch.randn(
201 |                 n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
202 |             self.emb_rel_v = nn.Parameter(torch.randn(
203 |                 n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
204 | 
205 |         nn.init.xavier_uniform_(self.conv_q.weight)
206 |         nn.init.xavier_uniform_(self.conv_k.weight)
207 |         nn.init.xavier_uniform_(self.conv_v.weight)
208 |         if proximal_init:
209 |             with torch.no_grad():
210 |                 self.conv_k.weight.copy_(self.conv_q.weight)
211 |                 self.conv_k.bias.copy_(self.conv_q.bias)
212 | 
213 |     def forward(self, x, c, attn_mask=None):
214 |         q = self.conv_q(x)
215 |         k = self.conv_k(c)
216 |         v = self.conv_v(c)
217 | 
218 |         x, self.attn = self.attention(q, k, v, mask=attn_mask)
219 | 
220 |         x = self.conv_o(x)
221 |         return x
222 | 
223 |     def attention(self, query, key, value, mask=None):
224 |         # reshape [b, d, t] -> [b, n_h, t, d_k]
225 |         b, d, t_s, t_t = (*key.size(), query.size(2))
226 |         #query = query.view(
227 |         #    b, 
228 |         #    self.n_heads, 
229 |         #    self.k_channels,
230 |         #    t_t
231 |         #).transpose(2, 3) #[b,h,t_t,c], d=h*c
232 |         #key = key.view(
233 |         #    b, 
234 |         #    self.n_heads, 
235 |         #    self.k_channels, 
236 |         #    t_s
237 |         #).transpose(2, 3) #[b,h,t_s,c]
238 |         #value = value.view(
239 |         #    b, 
240 |         #    self.n_heads, 
241 |         #    self.k_channels,
242 |         #    t_s
243 |         #).transpose(2, 3) #[b,h,t_s,c]
244 |         #scores = torch.matmul(
245 |         #    query / math.sqrt(self.k_channels), key.transpose(-2, -1)
246 |         #) #[b,h,t_t,t_s]
247 |         query = query.view(
248 |             b, 
249 |             self.n_heads, 
250 |             self.k_channels,
251 |             t_t
252 |         ) #[b,h,c,t_t]
253 |         key = key.view(
254 |             b, 
255 |             self.n_heads, 
256 |             self.k_channels, 
257 |             t_s
258 |         ) #[b,h,c,t_s]
259 |         value = value.view(
260 |             b, 
261 |             self.n_heads, 
262 |             self.k_channels,
263 |             t_s
264 |         ) #[b,h,c,t_s]
265 |         scores = torch.einsum('bhdt,bhds -> bhts', query / math.sqrt(self.k_channels), key) #[b,h,t_t,t_s]
266 |         #if self.window_size is not None:
267 |         #    assert t_s == t_t, "Relative attention is only available for self-attention."
268 |         #    key_relative_embeddings = self._get_relative_embeddings(
269 |         #        self.emb_rel_k, t_s
270 |         #    )
271 |         #    rel_logits = self._matmul_with_relative_keys(
272 |         #        query / math.sqrt(self.k_channels), key_relative_embeddings
273 |         #    ) #[b,h,t_t,d],[h or 1,e,d] ->[b,h,t_t,e]
274 |         #    scores_local = self._relative_position_to_absolute_position(rel_logits)
275 |         #    scores = scores + scores_local
276 |         #if self.proximal_bias:
277 |         #    assert t_s == t_t, "Proximal bias is only available for self-attention."
278 |         #    scores = scores + \
279 |         #        self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
280 |         #if mask is not None:
281 |         #    scores = scores.masked_fill(mask == 0, -1e4)
282 |         #    if self.block_length is not None:
283 |         #        assert t_s == t_t, "Local attention is only available for self-attention."
284 |         #        block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
285 |         #        scores = scores.masked_fill(block_mask == 0, -1e4)
286 |         #p_attn = F.softmax(scores, dim=-1)  # [b, h, t_t, t_s]
287 |         #p_attn = self.drop(p_attn)
288 |         #output = torch.matmul(p_attn, value) # [b,h,t_t,t_s],[b,h,t_s,c] -> [b,h,t_t,c]
289 |         #if self.window_size is not None:
290 |         #    relative_weights = self._absolute_position_to_relative_position(p_attn) #[b, h, t_t, 2*t_t-1]
291 |         #    value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s) #[h or 1, 2*t_t-1, c]
292 |         #    output = output + \
293 |         #        self._matmul_with_relative_values(
294 |         #            relative_weights, value_relative_embeddings) # [b, h, t_t, 2*t_t-1],[h or 1, 2*t_t-1, c] -> [b, h, t_t, c]
295 |         #output = output.transpose(2, 3).contiguous().view(b, d, t_t)  # [b, n_h, t_t, c] -> [b,h,c,t_t] -> [b, d, t_t]
296 |         if self.window_size is not None:
297 |             assert t_s == t_t, "Relative attention is only available for self-attention."
298 |             key_relative_embeddings = self._get_relative_embeddings(
299 |                 self.emb_rel_k, t_s
300 |             )
301 |             rel_logits = torch.einsum('bhdt,hed->bhte',
302 |                 query / math.sqrt(self.k_channels), key_relative_embeddings
303 |             ) #[b,h,c,t_t],[h or 1,e,c] ->[b,h,t_t,e]
304 |             scores_local = self._relative_position_to_absolute_position(rel_logits)
305 |             scores = scores + scores_local
306 |         if self.proximal_bias:
307 |             assert t_s == t_t, "Proximal bias is only available for self-attention."
308 |             scores = scores + \
309 |                 self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
310 |         if mask is not None:
311 |             scores = scores.masked_fill(mask == 0, -1e4)
312 |             if self.block_length is not None:
313 |                 assert t_s == t_t, "Local attention is only available for self-attention."
314 |                 block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
315 |                 scores = scores.masked_fill(block_mask == 0, -1e4)
316 |         p_attn = F.softmax(scores, dim=-1)  # [b, h, t_t, t_s]
317 |         p_attn = self.drop(p_attn)
318 |         output = torch.einsum('bhcs,bhts->bhct', value , p_attn) # [b,h,c,t_s],[b,h,t_t,t_s] -> [b,h,c,t_t]
319 |         if self.window_size is not None:
320 |             relative_weights = self._absolute_position_to_relative_position(p_attn) #[b, h, t_t, 2*t_t-1]
321 |             value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s) #[h or 1, 2*t_t-1, c]
322 |             output = output + \
323 |                 torch.einsum('bhte,hec->bhct',
324 |                     relative_weights, value_relative_embeddings) # [b, h, t_t, 2*t_t-1],[h or 1, 2*t_t-1, c] -> [b, h, c, t_t]
325 |         output = output.view(b, d, t_t)  # [b, h, c, t_t] -> [b, d, t_t]
326 |         return output, p_attn
327 | 
328 |     def _matmul_with_relative_values(self, x, y):
329 |         """
330 |         x: [b, h, l, m]
331 |         y: [h or 1, m, d]
332 |         ret: [b, h, l, d]
333 |         """
334 |         ret = torch.matmul(x, y.unsqueeze(0))
335 |         return ret
336 | 
337 |     def _matmul_with_relative_keys(self, x, y):
338 |         """
339 |         x: [b, h, l, d]
340 |         y: [h or 1, m, d]
341 |         ret: [b, h, l, m]
342 |         """
343 |         #ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
344 |         ret = torch.einsum('bhld,hmd -> bhlm', x, y)
345 |         return ret
346 | 
347 |     def _get_relative_embeddings(self, relative_embeddings, length):
348 |         max_relative_position = 2 * self.window_size + 1
349 |         # Pad first before slice to avoid using cond ops.
350 |         pad_length = max(length - (self.window_size + 1), 0)
351 |         slice_start_position = max((self.window_size + 1) - length, 0)
352 |         slice_end_position = slice_start_position + 2 * length - 1
353 |         if pad_length > 0:
354 |             padded_relative_embeddings = F.pad(
355 |                 relative_embeddings,
356 |                 commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]))
357 |         else:
358 |             padded_relative_embeddings = relative_embeddings
359 |         used_relative_embeddings = padded_relative_embeddings[
360 |             :, slice_start_position:slice_end_position
361 |         ]
362 |         return used_relative_embeddings
363 | 
364 |     def _relative_position_to_absolute_position(self, x):
365 |         """
366 |         x: [b, h, l, 2*l-1]
367 |         ret: [b, h, l, l]
368 |         """
369 |         batch, heads, length, _ = x.size()
370 |         # Concat columns of pad to shift from relative to absolute indexing.
371 |         x = F.pad(x, commons.convert_pad_shape(
372 |             [[0, 0], [0, 0], [0, 0], [0, 1]]
373 |         ))
374 | 
375 |         # Concat extra elements so to add up to shape (len+1, 2*len-1).
376 |         x_flat = x.view([batch, heads, length * 2 * length])
377 |         x_flat = F.pad(x_flat, commons.convert_pad_shape(
378 |             [[0, 0], [0, 0], [0, length-1]]
379 |         ))
380 | 
381 |         # Reshape and slice out the padded elements.
382 |         x_final = x_flat.view([batch, heads, length+1, 2*length-1])[
383 |             :, :, :length, length-1:
384 |         ]
385 |         return x_final
386 | 
387 |     def _absolute_position_to_relative_position(self, x):
388 |         """
389 |         x: [b, h, l, l]
390 |         ret: [b, h, l, 2*l-1]
391 |         """
392 |         batch, heads, length, _ = x.size()
393 |         # padd along column
394 |         x = F.pad(x, commons.convert_pad_shape(
395 |             [[0, 0], [0, 0], [0, 0], [0, length-1]]
396 |         ))
397 |         x_flat = x.view([batch, heads, length**2 + length*(length - 1)])
398 |         # add 0's in the beginning that will skew the elements after reshape
399 |         x_flat = F.pad(x_flat, commons.convert_pad_shape(
400 |             [[0, 0], [0, 0], [length, 0]]
401 |         ))
402 |         x_final = x_flat.view([batch, heads, length, 2*length])[:, :, :, 1:]
403 |         return x_final
404 | 
405 |     def _attention_bias_proximal(self, length):
406 |         """Bias for self-attention to encourage attention to close positions.
407 |         Args:
408 |           length: an integer scalar.
409 |         Returns:
410 |           a Tensor with shape [1, 1, length, length]
411 |         """
412 |         r = torch.arange(length, dtype=torch.float32)
413 |         diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
414 |         return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
415 | 
416 | 
417 | class FFN(nn.Module):
418 |     def __init__(
419 |         self, 
420 |         in_channels, 
421 |         out_channels, 
422 |         filter_channels, 
423 |         kernel_size, 
424 |         p_dropout=0., 
425 |         activation=None, 
426 |         causal=False
427 |     ):
428 |         super().__init__()
429 |         self.in_channels = in_channels
430 |         self.out_channels = out_channels
431 |         self.filter_channels = filter_channels
432 |         self.kernel_size = kernel_size
433 |         self.p_dropout = p_dropout
434 |         self.activation = activation
435 |         self.causal = causal
436 | 
437 |         if causal:
438 |             self.padding = self._causal_padding
439 |         else:
440 |             self.padding = self._same_padding
441 | 
442 |         self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
443 |         self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
444 |         self.drop = nn.Dropout(p_dropout)
445 | 
446 |     def forward(self, x, x_mask):
447 |         x = self.conv_1(self.padding(x * x_mask))
448 |         if self.activation == "gelu":
449 |             x = x * torch.sigmoid(1.702 * x)
450 |         else:
451 |             x = torch.relu(x)
452 |         x = self.drop(x)
453 |         x = self.conv_2(self.padding(x * x_mask))
454 |         return x * x_mask
455 | 
456 |     def _causal_padding(self, x):
457 |         if self.kernel_size == 1:
458 |             return x
459 |         pad_l = self.kernel_size - 1
460 |         pad_r = 0
461 |         padding = [[0, 0], [0, 0], [pad_l, pad_r]]
462 |         x = F.pad(x, commons.convert_pad_shape(padding))
463 |         return x
464 | 
465 |     def _same_padding(self, x):
466 |         if self.kernel_size == 1:
467 |             return x
468 |         pad_l = (self.kernel_size - 1) // 2
469 |         pad_r = self.kernel_size // 2
470 |         padding = [[0, 0], [0, 0], [pad_l, pad_r]]
471 |         x = F.pad(x, commons.convert_pad_shape(padding))
472 |         return x
473 | 


--------------------------------------------------------------------------------
/commons.py:
--------------------------------------------------------------------------------
  1 | # from https://github.com/jaywalnut310/vits
  2 | import math
  3 | import torch
  4 | from torch.nn import functional as F
  5 | 
  6 | 
  7 | def init_weights(m, mean=0.0, std=0.01):
  8 |     classname = m.__class__.__name__
  9 |     if classname.find("Conv") != -1:
 10 |         m.weight.data.normal_(mean, std)
 11 | 
 12 | 
 13 | def get_padding(kernel_size, dilation=1):
 14 |     return int((kernel_size * dilation - dilation) / 2)
 15 | 
 16 | 
 17 | def convert_pad_shape(pad_shape):
 18 |     l = pad_shape[::-1]
 19 |     pad_shape = [item for sublist in l for item in sublist]
 20 |     return pad_shape
 21 | 
 22 | 
 23 | def intersperse(lst, item):
 24 |     result = [item] * (len(lst) * 2 + 1)
 25 |     result[1::2] = lst
 26 |     return result
 27 | 
 28 | 
 29 | def kl_divergence(m_p, logs_p, m_q, logs_q):
 30 |     """KL(P||Q)"""
 31 |     kl = (logs_q - logs_p) - 0.5
 32 |     kl += 0.5 * (torch.exp(2. * logs_p) + ((m_p - m_q)**2)) * torch.exp(-2. * logs_q)
 33 |     return kl
 34 | 
 35 | 
 36 | def rand_gumbel(shape):
 37 |     """Sample from the Gumbel distribution, protect from overflows."""
 38 |     uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
 39 |     return -torch.log(-torch.log(uniform_samples))
 40 | 
 41 | 
 42 | def rand_gumbel_like(x):
 43 |     g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
 44 |     return g
 45 | 
 46 | 
 47 | def slice_segments(x, ids_str, segment_size=4):
 48 |     ret = torch.zeros_like(x[:, :, :segment_size])
 49 |     for i in range(x.size(0)):
 50 |         idx_str = ids_str[i]
 51 |         idx_end = idx_str + segment_size
 52 |         ret[i] = x[i, :, idx_str:idx_end]
 53 |     return ret
 54 | 
 55 | 
 56 | def rand_slice_segments(x, x_lengths=None, segment_size=4):
 57 |     b, d, t = x.size()
 58 |     if x_lengths is None:
 59 |         x_lengths = t
 60 |     ids_str_max = x_lengths - segment_size + 1
 61 |     ids_str = (torch.rand([b]).to(device=x.device)
 62 |                * ids_str_max).to(dtype=torch.long)
 63 |     ids_str = torch.max(torch.zeros(ids_str.size()).to(ids_str.device), ids_str).to(dtype=torch.long)
 64 |     ret = slice_segments(x, ids_str, segment_size)
 65 |     return ret, ids_str
 66 | 
 67 | def rand_slice_segments_for_cat(x, x_lengths=None, segment_size=4):
 68 |     b, d, t = x.size()
 69 |     if x_lengths is None:
 70 |         x_lengths = t
 71 |     ids_str_max = x_lengths - segment_size + 1
 72 |     ids_str = torch.rand([b//2]).to(device=x.device)
 73 |     ids_str = (torch.cat([ids_str,ids_str], dim=0)
 74 |                * ids_str_max).to(dtype=torch.long)
 75 |     ids_str = torch.max(torch.zeros(ids_str.size()).to(ids_str.device), ids_str).to(dtype=torch.long)
 76 |     ret = slice_segments(x, ids_str, segment_size)
 77 |     return ret, ids_str
 78 | 
 79 | 
 80 | 
 81 | 
 82 | def get_timing_signal_1d(
 83 |         length, channels, min_timescale=1.0, max_timescale=1.0e4):
 84 |     position = torch.arange(length, dtype=torch.float)
 85 |     num_timescales = channels // 2
 86 |     log_timescale_increment = (
 87 |         math.log(float(max_timescale) / float(min_timescale)) / (num_timescales - 1)
 88 |     )
 89 |     inv_timescales = min_timescale * torch.exp(
 90 |         torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
 91 |     )
 92 |     scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
 93 |     signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
 94 |     signal = F.pad(signal, [0, 0, 0, channels % 2])
 95 |     signal = signal.view(1, channels, length)
 96 |     return signal
 97 | 
 98 | 
 99 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
100 |     b, channels, length = x.size()
101 |     signal = get_timing_signal_1d(
102 |         length, channels, min_timescale, max_timescale
103 |     )
104 |     return x + signal.to(dtype=x.dtype, device=x.device)
105 | 
106 | 
107 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
108 |     b, channels, length = x.size()
109 |     signal = get_timing_signal_1d(
110 |         length, channels, min_timescale, max_timescale
111 |     )
112 |     return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
113 | 
114 | 
115 | def subsequent_mask(length):
116 |     mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
117 |     return mask
118 | 
119 | 
120 | @torch.jit.script
121 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
122 |     n_channels_int = n_channels[0]
123 |     in_act = input_a + input_b
124 |     t_act = torch.tanh(in_act[:, :n_channels_int, :])
125 |     s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
126 |     acts = t_act * s_act
127 |     return acts
128 | 
129 | 
130 | def convert_pad_shape(pad_shape):
131 |     l = pad_shape[::-1]
132 |     pad_shape = [item for sublist in l for item in sublist]
133 |     return pad_shape
134 | 
135 | 
136 | def shift_1d(x):
137 |     x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
138 |     return x
139 | 
140 | 
141 | def sequence_mask(length, max_length=None):
142 |     if max_length is None:
143 |         max_length = length.max()
144 |     x = torch.arange(max_length, dtype=length.dtype, device=length.device)
145 |     return x.unsqueeze(0) < length.unsqueeze(1)
146 | 
147 | 
148 | def generate_path(duration, mask):
149 |     """
150 |     duration: [b, 1, t_x]
151 |     mask: [b, 1, t_y, t_x]
152 |     """
153 |     device = duration.device
154 | 
155 |     b, _, t_y, t_x = mask.shape
156 |     cum_duration = torch.cumsum(duration, -1)
157 | 
158 |     cum_duration_flat = cum_duration.view(b * t_x)
159 |     path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
160 |     path = path.view(b, t_x, t_y)
161 |     path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
162 |     path = path.unsqueeze(1).transpose(2, 3) * mask
163 |     return path
164 | 
165 | 
166 | def clip_grad_value_(parameters, clip_value, norm_type=2):
167 |     if isinstance(parameters, torch.Tensor):
168 |         parameters = [parameters]
169 |     parameters = list(filter(lambda p: p.grad is not None, parameters))
170 |     norm_type = float(norm_type)
171 |     if clip_value is not None:
172 |         clip_value = float(clip_value)
173 | 
174 |     total_norm = 0
175 |     for p in parameters:
176 |         param_norm = p.grad.data.norm(norm_type)
177 |         total_norm += param_norm.item() ** norm_type
178 |         if clip_value is not None:
179 |             p.grad.data.clamp_(min=-clip_value, max=clip_value)
180 |     total_norm = total_norm ** (1. / norm_type)
181 |     return total_norm
182 | 


--------------------------------------------------------------------------------
/configs/config_en.yaml:
--------------------------------------------------------------------------------
 1 | train:
 2 |   log_interval: 200   # step unit
 3 |   eval_interval: 400  # step unit
 4 |   save_interval: 50  # epoch unit: 50 for baseline / 500 for fine-tuning
 5 |   seed: 1234
 6 |   epochs: 7000
 7 |   learning_rate: 2e-4 
 8 |   betas: [0.8, 0.99]
 9 |   eps: 1e-9
10 |   batch_size: 48
11 |   fp16_run: True  #False
12 |   lr_decay: 0.999875
13 |   segment_size: 8192
14 |   c_mel: 45
15 |   c_kl: 1.0
16 |   c_vq: 1.
17 |   c_commit: 0.2
18 |   c_yin: 45.
19 |   log_path: "/pits/logs"
20 |   n_sample: 3
21 |   alpha: 200
22 | 
23 | data:
24 |   data_path: "/DATA/audio/VCTK-0.92"
25 |   training_files: "filelists/vctk_train_g2p.txt"
26 |   validation_files: "filelists/vctk_val_g2p.txt"
27 |   languages: "en_US" 
28 |   text_cleaners: ["english_cleaners"]
29 |   sampling_rate: 22050
30 |   filter_length: 1024
31 |   hop_length: 256
32 |   win_length: 1024
33 |   n_mel_channels: 80
34 |   mel_fmin: 0.0
35 |   mel_fmax: null
36 |   add_blank: True
37 |   speakers: ["p225", "p226", "p227", "p228", "p229", "p230", "p231", "p232", "p233", "p234", "p236", "p237", "p238", "p239", "p240", "p241", "p243", "p244", "p245", "p246", "p247", "p248", "p249", "p250", "p251", "p252", "p253", "p254", "p255", "p256", "p257", "p258", "p259", "p260", "p261", "p262", "p263", "p264", "p265", "p266", "p267", "p268", "p269", "p270", "p271", "p272", "p273", "p274", "p275", "p276", "p277", "p278", "p279", "p281", "p282", "p283", "p284", "p285", "p286", "p287", "p288", "p292", "p293", "p294", "p295", "p297", "p298", "p299", "p300", "p301", "p302", "p303", "p304", "p305", "p306", "p307", "p308", "p310", "p311", "p312", "p313", "p314", "p316", "p317", "p318", "p323", "p326", "p329", "p330", "p333", "p334", "p335", "p336", "p339", "p340", "p341", "p343", "p345", "p347", "p351", "p360", "p361", "p362", "p363", "p364", "p374", "p376", "s5"]
38 |   persistent_workers: True
39 |   midi_start: -5
40 |   midi_end: 75
41 |   midis: 80
42 |   ying_window: 2048
43 |   ying_hop: 256
44 |   tau_max: 2048
45 |   octave_range: 24
46 |   
47 | model:
48 |   inter_channels: 192
49 |   hidden_channels: 192
50 |   filter_channels: 768
51 |   n_heads: 2
52 |   n_layers: 6
53 |   kernel_size: 3
54 |   p_dropout: 0.1
55 |   resblock: "1"
56 |   resblock_kernel_sizes: [3,7,11]
57 |   resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
58 |   upsample_rates: [8,8,2,2]
59 |   upsample_initial_channel: 512
60 |   upsample_kernel_sizes: [16,16,4,4]
61 |   n_layers_q: 3
62 |   use_spectral_norm: False
63 |   gin_channels: 256
64 |   codebook_size: 320
65 |   yin_channels: 80
66 |   yin_start: 15 # scope start bin in nansy = 1.5/8 
67 |   yin_scope: 50 # scope ratio in nansy = 5/8 
68 |   yin_shift_range: 15 # same as default start index of yingram
69 | 


--------------------------------------------------------------------------------
/configs/config_ja_22050.yaml:
--------------------------------------------------------------------------------
 1 | train:
 2 |   log_interval: 200   # step unit
 3 |   eval_interval: 400  # step unit
 4 |   save_interval: 50  # epoch unit: 50 for baseline / 500 for fine-tuning
 5 |   seed: 1234
 6 |   epochs: 7000
 7 |   learning_rate: 2e-4 
 8 |   betas: [0.8, 0.99]
 9 |   eps: 1e-9
10 |   batch_size: 48
11 |   fp16_run: True  #False
12 |   lr_decay: 0.999875
13 |   segment_size: 8192
14 |   c_mel: 45
15 |   c_kl: 1.0
16 |   c_vq: 1.
17 |   c_commit: 0.2
18 |   c_yin: 45.
19 |   log_path: "/logs/"
20 |   n_sample: 3
21 |   alpha: 200
22 | 
23 | data:
24 |   data_path: "./dataset/jvs_ver1/"
25 |   training_files: "./filelists/jvs_train_22050.txt"
26 |   validation_files: "./filelists/jvs_val_22050.txt"
27 |   languages: "pyopenjtalk_prosody" 
28 |   text_cleaners: []
29 |   sampling_rate: 22050
30 |   filter_length: 1024
31 |   hop_length: 256
32 |   win_length: 1024
33 |   n_mel_channels: 80
34 |   mel_fmin: 0.0
35 |   mel_fmax: null
36 |   add_blank: True
37 |   speakers: ['jvs001', 'jvs002', 'jvs003', 'jvs004', 'jvs005', 'jvs006', 'jvs007', 'jvs008', 'jvs009', 'jvs010', 'jvs011', 'jvs012', 'jvs013', 'jvs014', 'jvs015', 'jvs016', 'jvs017', 'jvs018', 'jvs019', 'jvs020', 'jvs021', 'jvs022', 'jvs023', 'jvs024', 'jvs025', 'jvs026', 'jvs027', 'jvs028', 'jvs029', 'jvs030', 'jvs031', 'jvs032', 'jvs033', 'jvs034', 'jvs035', 'jvs036', 'jvs037', 'jvs038', 'jvs039', 'jvs040', 'jvs041', 'jvs042', 'jvs043', 'jvs044', 'jvs045', 'jvs046', 'jvs047', 'jvs048', 'jvs049', 'jvs050', 'jvs051', 'jvs052', 'jvs053', 'jvs054', 'jvs055', 'jvs056', 'jvs057', 'jvs058', 'jvs059', 'jvs060', 'jvs061', 'jvs062', 'jvs063', 'jvs064', 'jvs065', 'jvs066', 'jvs067', 'jvs068', 'jvs069', 'jvs070', 'jvs071', 'jvs072', 'jvs073', 'jvs074', 'jvs075', 'jvs076', 'jvs077', 'jvs078', 'jvs079', 'jvs080', 'jvs081', 'jvs082', 'jvs083', 'jvs084', 'jvs085', 'jvs086', 'jvs087', 'jvs088', 'jvs089', 'jvs090', 'jvs091', 'jvs092', 'jvs093', 'jvs094', 'jvs095', 'jvs096', 'jvs097', 'jvs098', 'jvs099', 'jvs100']
38 |   persistent_workers: True
39 |   midi_start: -5
40 |   midi_end: 75
41 |   midis: 80
42 |   ying_window: 2048
43 |   ying_hop: 256
44 |   tau_max: 2048
45 |   octave_range: 24
46 |   
47 | model:
48 |   inter_channels: 192
49 |   hidden_channels: 192
50 |   filter_channels: 768
51 |   n_heads: 2
52 |   n_layers: 6
53 |   kernel_size: 3
54 |   p_dropout: 0.1
55 |   resblock: "1"
56 |   resblock_kernel_sizes: [3,7,11]
57 |   resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
58 |   upsample_rates: [8,8,2,2]
59 |   upsample_initial_channel: 512
60 |   upsample_kernel_sizes: [16,16,4,4]
61 |   n_layers_q: 3
62 |   use_spectral_norm: False
63 |   gin_channels: 256
64 |   codebook_size: 320
65 |   yin_channels: 80
66 |   yin_start: 15 # scope start bin in nansy = 1.5/8 
67 |   yin_scope: 50 # scope ratio in nansy = 5/8 
68 |   yin_shift_range: 15 # same as default start index of yingram
69 | 


--------------------------------------------------------------------------------
/configs/config_ja_44100.yaml:
--------------------------------------------------------------------------------
 1 | train:
 2 |   log_interval: 200   # step unit
 3 |   eval_interval: 400  # step unit
 4 |   save_interval: 10  # epoch unit: 50 for baseline / 500 for fine-tuning
 5 |   seed: 1234
 6 |   epochs: 7000
 7 |   learning_rate: 2e-4 
 8 |   betas: [0.8, 0.99]
 9 |   eps: 1e-9
10 |   batch_size: 32
11 |   fp16_run: True  #False
12 |   lr_decay: 0.999875
13 |   segment_size: 16384 #8192
14 |   c_mel: 45
15 |   c_kl: 1.0
16 |   c_vq: 1.
17 |   c_commit: 0.2
18 |   c_yin: 45.
19 |   log_path: "/logs/"
20 |   n_sample: 3
21 |   alpha: 200
22 | 
23 | data:
24 |   data_path: "./dataset/jvs_ver1/"
25 |   training_files: "./filelists/jvs_train_44100.txt"
26 |   validation_files: "./filelists/jvs_val_44100.txt"
27 |   languages: "pyopenjtalk_prosody" 
28 |   text_cleaners: []
29 |   sampling_rate: 44100 #22050 
30 |   filter_length: 2048 #1024 
31 |   hop_length: 512 #256 
32 |   win_length: 2048 #1024
33 |   n_mel_channels: 80
34 |   mel_fmin: 0.0
35 |   mel_fmax: null
36 |   add_blank: True
37 |   speakers: ['jvs001', 'jvs002', 'jvs003', 'jvs004', 'jvs005', 'jvs006', 'jvs007', 'jvs008', 'jvs009', 'jvs010', 'jvs011', 'jvs012', 'jvs013', 'jvs014', 'jvs015', 'jvs016', 'jvs017', 'jvs018', 'jvs019', 'jvs020', 'jvs021', 'jvs022', 'jvs023', 'jvs024', 'jvs025', 'jvs026', 'jvs027', 'jvs028', 'jvs029', 'jvs030', 'jvs031', 'jvs032', 'jvs033', 'jvs034', 'jvs035', 'jvs036', 'jvs037', 'jvs038', 'jvs039', 'jvs040', 'jvs041', 'jvs042', 'jvs043', 'jvs044', 'jvs045', 'jvs046', 'jvs047', 'jvs048', 'jvs049', 'jvs050', 'jvs051', 'jvs052', 'jvs053', 'jvs054', 'jvs055', 'jvs056', 'jvs057', 'jvs058', 'jvs059', 'jvs060', 'jvs061', 'jvs062', 'jvs063', 'jvs064', 'jvs065', 'jvs066', 'jvs067', 'jvs068', 'jvs069', 'jvs070', 'jvs071', 'jvs072', 'jvs073', 'jvs074', 'jvs075', 'jvs076', 'jvs077', 'jvs078', 'jvs079', 'jvs080', 'jvs081', 'jvs082', 'jvs083', 'jvs084', 'jvs085', 'jvs086', 'jvs087', 'jvs088', 'jvs089', 'jvs090', 'jvs091', 'jvs092', 'jvs093', 'jvs094', 'jvs095', 'jvs096', 'jvs097', 'jvs098', 'jvs099', 'jvs100']
38 |   persistent_workers: True
39 |   midi_start: -5
40 |   midi_end: 75
41 |   midis: 80
42 |   ying_window: 4096 #2048 =tau_max
43 |   ying_hop: 512 #256   
44 |   tau_max: 4096 #2048  diff maximum period
45 |   octave_range: 24
46 |   
47 | model:
48 |   inter_channels: 192
49 |   hidden_channels: 192
50 |   filter_channels: 768
51 |   n_heads: 2
52 |   n_layers: 6
53 |   kernel_size: 3
54 |   p_dropout: 0.1
55 |   resblock: "1"
56 |   resblock_kernel_sizes: [3,7,11]
57 |   resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
58 |   upsample_rates: [8,8,2,2,2] #[8,8,2,2]
59 |   upsample_initial_channel: 512
60 |   upsample_kernel_sizes: [16,16,4,4,4] #[16,16,4,4]
61 |   n_layers_q: 3
62 |   use_spectral_norm: False
63 |   gin_channels: 256
64 |   codebook_size: 320
65 |   yin_channels: 80
66 |   yin_start: 15 # scope start bin in nansy = 1.5/8 
67 |   yin_scope: 50 # scope ratio in nansy = 5/8 
68 |   yin_shift_range: 15 # same as default start index of yingram
69 | 


--------------------------------------------------------------------------------
/data_utils.py:
--------------------------------------------------------------------------------
  1 | # modified from https://github.com/jaywalnut310/vits
  2 | import os
  3 | import random
  4 | import torch
  5 | import torch.utils.data
  6 | 
  7 | import commons
  8 | from mel_processing import spectrogram_torch
  9 | from utils import load_wav_to_torch, load_filepaths_and_text
 10 | from text import text_to_sequence
 11 | from analysis import Pitch
 12 | """ Modified from Multi speaker version of VITS"""
 13 | 
 14 | 
 15 | class TextAudioSpeakerLoader(torch.utils.data.Dataset):
 16 |     """
 17 |         1) loads audio, speaker_id, text pairs
 18 |         2) normalizes text and converts them to sequences of integers
 19 |         3) computes spectrograms from audio files.
 20 |     """
 21 | 
 22 |     def __init__(self, audiopaths_sid_text, hparams, pt_run=False):
 23 |         self.audiopaths_sid_text = load_filepaths_and_text(audiopaths_sid_text)
 24 |         self.text_cleaners = hparams.text_cleaners
 25 |         self.sampling_rate = hparams.sampling_rate
 26 |         self.filter_length = hparams.filter_length
 27 |         self.hop_length = hparams.hop_length
 28 |         self.win_length = hparams.win_length
 29 | 
 30 |         self.lang = hparams.languages
 31 | 
 32 |         self.add_blank = hparams.add_blank
 33 |         self.min_text_len = getattr(hparams, "min_text_len", 1)
 34 |         self.max_text_len = getattr(hparams, "max_text_len", 190)
 35 | 
 36 |         self.speaker_dict = {
 37 |             speaker: idx
 38 |             for idx, speaker in enumerate(hparams.speakers)
 39 |         }
 40 |         self.data_path = hparams.data_path
 41 | 
 42 |         self.pitch = Pitch(sr=hparams.sampling_rate,
 43 |                            ### modify ##############
 44 |                            #W=hparams.tau_max,
 45 |                            W=hparams.ying_window,           
 46 |                            w_step=hparams.ying_hop,
 47 |                            ######################### 
 48 |                            tau_max=hparams.tau_max,
 49 |                            midi_start=hparams.midi_start,
 50 |                            midi_end=hparams.midi_end,
 51 |                            octave_range=hparams.octave_range)
 52 | 
 53 |         random.seed(1234)
 54 |         random.shuffle(self.audiopaths_sid_text)
 55 |         self._filter()
 56 |         if pt_run:
 57 |             for _audiopaths_sid_text in self.audiopaths_sid_text:
 58 |                 _ = self.get_audio_text_speaker_pair(_audiopaths_sid_text,
 59 |                                                      True)
 60 | 
 61 |     def _filter(self):
 62 |         """
 63 |         Filter text & store spec lengths
 64 |         """
 65 |         # Store spectrogram lengths for Bucketing
 66 |         # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2)
 67 |         # spec_length = wav_length // hop_length
 68 | 
 69 |         audiopaths_sid_text_new = []
 70 |         lengths = []
 71 |         for audiopath, text, spk in self.audiopaths_sid_text:
 72 |             audiopath = audiopath.replace("\\", "/")
 73 |             if self.min_text_len <= len(text) and len(
 74 |                     text) <= self.max_text_len:
 75 |                 audiopath = os.path.join(self.data_path, audiopath)
 76 |                 audiopaths_sid_text_new.append([audiopath, text, spk])
 77 |                 lengths.append(
 78 |                     os.path.getsize(audiopath) // (2 * self.hop_length))
 79 |         self.audiopaths_sid_text = audiopaths_sid_text_new
 80 |         self.lengths = lengths
 81 | 
 82 |     def get_audio_text_speaker_pair(self, audiopath_sid_text, pt_run=False):
 83 |         # separate filename, speaker_id and text
 84 |         audiopath, text, spk = audiopath_sid_text[0], audiopath_sid_text[
 85 |             1], audiopath_sid_text[2]
 86 |         text, tone = self.get_text(text)
 87 |         spec, ying, wav = self.get_audio(audiopath, pt_run)
 88 |         sid = self.get_sid(self.speaker_dict[spk])
 89 |         return (text, spec, ying, wav, sid, tone)
 90 | 
 91 |     def get_audio(self, filename, pt_run=False):
 92 |         audio, sampling_rate = load_wav_to_torch(filename)
 93 |         if sampling_rate != self.sampling_rate:
 94 |             raise ValueError("{} {} SR doesn't match target {} SR".format(
 95 |                 sampling_rate, self.sampling_rate))
 96 |         audio_norm = audio.unsqueeze(0)
 97 |         spec_filename = filename.replace(".wav", ".spec.pt")
 98 |         ying_filename = filename.replace(".wav", ".ying.pt")
 99 |         if os.path.exists(spec_filename) and not pt_run:
100 |             spec = torch.load(spec_filename, map_location='cpu')
101 |         else:
102 |             spec = spectrogram_torch(audio_norm,
103 |                                      self.filter_length,
104 |                                      self.sampling_rate,
105 |                                      self.hop_length,
106 |                                      self.win_length,
107 |                                      center=False)
108 |             spec = torch.squeeze(spec, 0)
109 |             torch.save(spec, spec_filename)
110 |         if os.path.exists(ying_filename) and not pt_run:
111 |             ying = torch.load(ying_filename, map_location='cpu')
112 |         else:
113 |             wav = torch.nn.functional.pad(
114 |                 audio_norm.unsqueeze(0),
115 |                 (self.filter_length - self.hop_length,
116 |                  self.filter_length - self.hop_length +
117 |                  (-audio_norm.shape[1]) % self.hop_length + self.hop_length * (audio_norm.shape[1] % self.hop_length == 0)),
118 |                 mode='constant').squeeze(0)
119 |             ying = self.pitch.yingram(wav)[0]
120 |             torch.save(ying, ying_filename)
121 |         return spec, ying, audio_norm
122 | 
123 |     def get_text(self, text):
124 |         text_norm, tone = text_to_sequence(text, self.lang)
125 |         if self.add_blank:
126 |             text_norm = commons.intersperse(text_norm, 0)
127 |             tone = commons.intersperse(tone, 0)
128 |         text_norm = torch.LongTensor(text_norm)
129 |         tone = torch.LongTensor(tone)
130 |         return text_norm, tone
131 | 
132 |     def get_sid(self, sid):
133 |         sid = torch.LongTensor([int(sid)])
134 |         return sid
135 | 
136 |     def __getitem__(self, index):
137 |         return self.get_audio_text_speaker_pair(
138 |             self.audiopaths_sid_text[index])
139 | 
140 |     def __len__(self):
141 |         return len(self.audiopaths_sid_text)
142 | 
143 | 
144 | class TextAudioSpeakerCollate():
145 |     """ Zero-pads model inputs and targets"""
146 | 
147 |     def __init__(self, return_ids=False):
148 |         self.return_ids = return_ids
149 | 
150 |     def __call__(self, batch):
151 |         """Collate's training batch from normalized text, audio and speaker identities
152 |         PARAMS
153 |         ------
154 |         batch: [text_normalized, spec_normalized, wav_normalized, sid]
155 |         """
156 |         # Right zero-pad all one-hot text sequences to max input length
157 |         _, ids_sorted_decreasing = torch.sort(torch.LongTensor(
158 |             [x[1].size(1) for x in batch]),
159 |                                               dim=0,
160 |                                               descending=True)
161 | 
162 |         max_text_len = max([len(x[0]) for x in batch])
163 |         max_spec_len = max([x[1].size(1) for x in batch])
164 |         max_ying_len = max([x[2].size(1) for x in batch])
165 |         max_wav_len = max([x[3].size(1) for x in batch])
166 | 
167 |         text_lengths = torch.LongTensor(len(batch))
168 |         spec_lengths = torch.LongTensor(len(batch))
169 |         ying_lengths = torch.LongTensor(len(batch))
170 |         wav_lengths = torch.LongTensor(len(batch))
171 |         sid = torch.LongTensor(len(batch))
172 | 
173 |         text_padded = torch.LongTensor(len(batch), max_text_len)
174 |         tone_padded = torch.LongTensor(len(batch), max_text_len)
175 |         spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0),
176 |                                         max_spec_len)
177 |         ying_padded = torch.FloatTensor(len(batch), batch[0][2].size(0),
178 |                                         max_ying_len)
179 |         wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len)
180 |         text_padded.zero_()
181 |         tone_padded.zero_()
182 |         spec_padded.zero_()
183 |         ying_padded.zero_()
184 |         wav_padded.zero_()
185 |         for i in range(len(ids_sorted_decreasing)):
186 |             row = batch[ids_sorted_decreasing[i]]
187 | 
188 |             text = row[0]
189 |             text_padded[i, :text.size(0)] = text
190 |             text_lengths[i] = text.size(0)
191 | 
192 |             spec = row[1]
193 |             spec_padded[i, :, :spec.size(1)] = spec
194 |             spec_lengths[i] = spec.size(1)
195 | 
196 |             ying = row[2]
197 |             ying_padded[i, :, :ying.size(1)] = ying
198 |             ying_lengths[i] = ying.size(1)
199 | 
200 |             wav = row[3]
201 |             wav_padded[i, :, :wav.size(1)] = wav
202 |             wav_lengths[i] = wav.size(1)
203 | 
204 |             tone = row[5]
205 |             tone_padded[i, :text.size(0)] = tone
206 | 
207 |             sid[i] = row[4]
208 | 
209 |         if self.return_ids:
210 |             return text_padded, text_lengths, spec_padded, spec_lengths, wav_padded, wav_lengths, sid, ids_sorted_decreasing
211 |         return text_padded, text_lengths, spec_padded, spec_lengths, ying_padded, ying_lengths, wav_padded, wav_lengths, sid, tone_padded
212 | 
213 | 
214 | class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler
215 |                                ):
216 |     """
217 |     Maintain similar input lengths in a batch.
218 |     Length groups are specified by boundaries.
219 |     Ex) boundaries = [b1, b2, b3] -> any batch is included either {x | b1 < length(x) <=b2} or {x | b2 < length(x) <= b3}.
220 | 
221 |     It removes samples which are not included in the boundaries.
222 |     Ex) boundaries = [b1, b2, b3] -> any x s.t. length(x) <= b1 or length(x) > b3 are discarded.
223 |     """
224 | 
225 |     def __init__(self,
226 |                  dataset,
227 |                  batch_size,
228 |                  boundaries,
229 |                  num_replicas=None,
230 |                  rank=None,
231 |                  shuffle=True):
232 |         super().__init__(dataset,
233 |                          num_replicas=num_replicas,
234 |                          rank=rank,
235 |                          shuffle=shuffle)
236 |         self.lengths = dataset.lengths
237 |         self.batch_size = batch_size
238 |         self.boundaries = boundaries
239 | 
240 |         self.buckets, self.num_samples_per_bucket = self._create_buckets()
241 |         self.total_size = sum(self.num_samples_per_bucket)
242 |         self.num_samples = self.total_size // self.num_replicas
243 | 
244 |     def _create_buckets(self):
245 |         buckets = [[] for _ in range(len(self.boundaries) - 1)]
246 |         for i in range(len(self.lengths)):
247 |             length = self.lengths[i]
248 |             idx_bucket = self._bisect(length)
249 |             if idx_bucket != -1:
250 |                 buckets[idx_bucket].append(i)
251 | 
252 |         for i in range(len(buckets) - 1, -1, -1):
253 |             if len(buckets[i]) == 0:
254 |                 buckets.pop(i)
255 |                 self.boundaries.pop(i + 1)
256 | 
257 |         num_samples_per_bucket = []
258 |         for i in range(len(buckets)):
259 |             len_bucket = len(buckets[i])
260 |             total_batch_size = self.num_replicas * self.batch_size
261 |             rem = (total_batch_size -
262 |                    (len_bucket % total_batch_size)) % total_batch_size
263 |             num_samples_per_bucket.append(len_bucket + rem)
264 |         return buckets, num_samples_per_bucket
265 | 
266 |     def __iter__(self):
267 |         # deterministically shuffle based on epoch
268 |         g = torch.Generator()
269 |         g.manual_seed(self.epoch)
270 | 
271 |         indices = []
272 |         if self.shuffle:
273 |             for bucket in self.buckets:
274 |                 indices.append(
275 |                     torch.randperm(len(bucket), generator=g).tolist())
276 |         else:
277 |             for bucket in self.buckets:
278 |                 indices.append(list(range(len(bucket))))
279 | 
280 |         batches = []
281 |         for i in range(len(self.buckets)):
282 |             bucket = self.buckets[i]
283 |             len_bucket = len(bucket)
284 |             ids_bucket = indices[i]
285 |             num_samples_bucket = self.num_samples_per_bucket[i]
286 | 
287 |             # add extra samples to make it evenly divisible
288 |             rem = num_samples_bucket - len_bucket
289 |             ids_bucket = ids_bucket + ids_bucket * \
290 |                 (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
291 | 
292 |             # subsample
293 |             ids_bucket = ids_bucket[self.rank::self.num_replicas]
294 | 
295 |             # batching
296 |             for j in range(len(ids_bucket) // self.batch_size):
297 |                 batch = [
298 |                     bucket[idx]
299 |                     for idx in ids_bucket[j * self.batch_size:(j + 1) *
300 |                                           self.batch_size]
301 |                 ]
302 |                 batches.append(batch)
303 | 
304 |         if self.shuffle:
305 |             batch_ids = torch.randperm(len(batches), generator=g).tolist()
306 |             batches = [batches[i] for i in batch_ids]
307 |         self.batches = batches
308 | 
309 |         assert len(self.batches) * self.batch_size == self.num_samples
310 |         return iter(self.batches)
311 | 
312 |     def _bisect(self, x, lo=0, hi=None):
313 |         if hi is None:
314 |             hi = len(self.boundaries) - 1
315 | 
316 |         if hi > lo:
317 |             mid = (hi + lo) // 2
318 |             if self.boundaries[mid] < x and x <= self.boundaries[mid + 1]:
319 |                 return mid
320 |             elif x <= self.boundaries[mid]:
321 |                 return self._bisect(x, lo, mid)
322 |             else:
323 |                 return self._bisect(x, mid + 1, hi)
324 |         else:
325 |             return -1
326 | 
327 |     def __len__(self):
328 |         return self.num_samples // self.batch_size
329 | 
330 | ### add ###
331 | import copy
332 | from tqdm import tqdm
333 | ###########
334 | def create_spec(audiopaths_sid_text, hparams):
335 |     audiopaths_sid_text_list = load_filepaths_and_text(audiopaths_sid_text)
336 |     for audiopath, _, _ in tqdm(audiopaths_sid_text_list):          ### add tqdm### 
337 | 
338 |         ### modify ###
339 |         audio_path = copy.deepcopy(audiopath)
340 |         audiopath = os.path.join(hparams.data_path, audiopath)
341 |         try:
342 |             audio, sampling_rate = load_wav_to_torch(audiopath)
343 |         except:
344 |             with open(audiopaths_sid_text, mode="r", encoding="utf-8") as f:
345 |                 lines = f.readlines()
346 |             for idx, line in enumerate(lines):
347 |                 check_path, sentence, spk_id = line.split("|")
348 |                 if check_path == audio_path:
349 |                     remove_path = lines.pop(idx)
350 |             with open(audiopaths_sid_text, mode="w", encoding="utf-8") as f:
351 |                 f.writelines(lines)
352 |             print(f"File is not found!!! path={remove_path}   ==> Removed.")
353 |             continue
354 |         ##############
355 |         
356 |         if sampling_rate != hparams.sampling_rate:
357 |             raise ValueError("{} {} SR doesn't match target {} SR".format(
358 |                 sampling_rate, hparams.sampling_rate))
359 |         audio_norm = audio.unsqueeze(0)
360 |         specpath = audiopath.replace(".wav", ".spec.pt")
361 | 
362 |         if not os.path.exists(specpath):
363 |             spec = spectrogram_torch(audio_norm,
364 |                                      hparams.filter_length,
365 |                                      hparams.sampling_rate,
366 |                                      hparams.hop_length,
367 |                                      hparams.win_length,
368 |                                      center=False)
369 |             spec = torch.squeeze(spec, 0)
370 |             torch.save(spec, specpath)
371 | 
372 | 
373 | ### add ###
374 | def dataset_check(audiopaths_sid_text, hparams):
375 |     audiopaths_sid_text_list = load_filepaths_and_text(audiopaths_sid_text)
376 |     for audiopath, _, _ in tqdm(audiopaths_sid_text_list):          ### add tqdm### 
377 | 
378 |         ### modify ###
379 |         audio_path = copy.deepcopy(audiopath)
380 |         audiopath = os.path.join(hparams.data_path, audiopath)
381 |         try:
382 |             audio, sampling_rate = load_wav_to_torch(audiopath)
383 |         except:
384 |             with open(audiopaths_sid_text, mode="r", encoding="utf-8") as f:
385 |                 lines = f.readlines()
386 |             for idx, line in enumerate(lines):
387 |                 check_path, sentence, spk_id = line.split("|")
388 |                 if check_path == audio_path:
389 |                     remove_path = lines.pop(idx)
390 |             with open(audiopaths_sid_text, mode="w", encoding="utf-8") as f:
391 |                 f.writelines(lines)
392 |             print(f"File is not found!!! path={remove_path}   ==> Removed.")
393 |             continue
394 |         ##############
395 |         
396 |         if sampling_rate != hparams.sampling_rate:
397 |             raise ValueError("{} {} SR doesn't match target {} SR".format(
398 |                 sampling_rate, hparams.sampling_rate))
399 | 
400 | 
401 | ### add ###
402 | def infer_text_process(text, hps):
403 | 
404 |     lang = hps.data.languages
405 |     add_blank = hps.data.add_blank
406 |     
407 |     # get_text function
408 |     text_norm, tone = text_to_sequence(text, lang)
409 |     if add_blank:
410 |         text_norm = commons.intersperse(text_norm, 0)
411 |         tone = commons.intersperse(tone, 0)
412 |     
413 |     text_length = torch.tensor(len(text_norm), dtype=torch.int64)
414 |     text_length = torch.unsqueeze(torch.tensor(len(text_norm), dtype=torch.int64),dim=0)
415 |     text_norm = torch.unsqueeze(torch.LongTensor(text_norm), dim=0)
416 |     tone = torch.unsqueeze(torch.LongTensor(tone), dim=0)
417 | 
418 | 
419 |     return text_norm, tone, text_length
420 | ###########


--------------------------------------------------------------------------------
/dataset/preprocess.py:
--------------------------------------------------------------------------------
 1 | import librosa
 2 | import os
 3 | import soundfile
 4 | from tqdm import tqdm
 5 | import random
 6 | import argparse
 7 |  
 8 | def main(dataset_dir:str = "./jvs_ver1/", target_sr:int = 22050):
 9 |     
10 |     use_wav_folder = ["parallel100", "nonpara30"] #, "whisper10","falset10"]
11 |     text_list = list()
12 | 
13 |     file_count = 0
14 |     sentence_count = 0
15 |     for spk_idx in range(100):
16 |         spk_idx += 1
17 |         spk_name = "jvs" + str(spk_idx).zfill(3)
18 |         for folder in use_wav_folder:
19 |             target_folder = os.path.join(dataset_dir, spk_name, folder, "wav24kHz16bit")
20 |             results_folder = os.path.join(dataset_dir, spk_name, folder, f"wav{target_sr}Hz16bit")
21 |             os.makedirs(results_folder, exist_ok=True)
22 |             
23 |             for filename in tqdm(os.listdir(target_folder), desc=spk_name):
24 |                 wav_path = os.path.join(target_folder, filename)
25 |                 y, sr = librosa.load(wav_path, sr=24000)
26 |                 y_converted = librosa.resample(y, orig_sr=24000, target_sr=target_sr)
27 |                 save_path = os.path.join(results_folder, filename)
28 |                 soundfile.write(save_path, y_converted, target_sr)
29 |                 
30 |             txt_path = os.path.join(dataset_dir, spk_name, folder, "transcripts_utf8.txt")
31 |             for txt in read_txt(txt_path):
32 |                 if txt == "\n":
33 |                     continue
34 |                 name, sentence = txt.split(":")
35 |                 sentence = sentence.replace("\n", "")
36 |                 wav_filepath = os.path.join(spk_name, folder, f"wav{target_sr}Hz16bit", name+".wav")
37 | 
38 |                 out_txt = wav_filepath + "|" + sentence + "|" + spk_name + "\n"
39 |                 text_list.append(out_txt)
40 | 
41 |     max_n = len(text_list)
42 |     test_list = list()
43 |     for _ in range(int(max_n * 0.005)):
44 |         n = len(text_list)
45 |         idx = random.randint(9, int(n-1))
46 |         txt = text_list.pop(idx)
47 |         test_list.append(txt)
48 | 
49 |         
50 |     max_n = len(text_list)
51 |     val_list = list()
52 |     for _ in range(int(max_n * 0.005)):
53 |         n = len(text_list)
54 |         idx = random.randint(9, int(n-1))
55 |         txt = text_list.pop(idx)
56 |         val_list.append(txt)
57 | 
58 |     write_txt(f"./filelists/jvs_train_{target_sr}.txt", text_list)
59 |     write_txt(f"./filelists/jvs_val_{target_sr}.txt", val_list)
60 |     write_txt(f"./filelists/jvs_test_{target_sr}.txt", test_list)
61 | 
62 |     
63 |     return 0
64 | 
65 | def read_txt(path):
66 |     with open(path, mode="r", encoding="utf-8")as f:
67 |         lines = f.readlines()
68 |     return lines
69 | 
70 | def write_txt(path, lines):
71 |     with open(path, mode="w", encoding="utf-8")as f:
72 |         f.writelines(lines)
73 | 
74 | 
75 | if __name__ == "__main__":
76 | 
77 |     parser = argparse.ArgumentParser()
78 | 
79 |     parser.add_argument('--folder_path',
80 |                         type=str,
81 |                         required=True, 
82 |                         help='Path to jvs corpus folder')
83 |     parser.add_argument('--sampling_rate',
84 |                         type=str,
85 |                         required=True, 
86 |                         help='Target sampling rate')
87 | 
88 |     args = parser.parse_args()
89 | 
90 |     main(dataset_dir=args.folder_path, target_sr=int(args.sampling_rate))


--------------------------------------------------------------------------------
/filelists/vctk_val_g2p.txt:
--------------------------------------------------------------------------------
  1 | wav22_silence_trimmed_wav/p225/p225_357_mic1.wav|{IH T} {IH Z} {AH} {S AY N} {AH V} {HH OW P}.|p225
  2 | wav22_silence_trimmed_wav/p225/p225_358_mic1.wav|{DH AH} {K AH M IH T M AH N T} {W AA Z} {N AA T} {L AO NG}, {AE N D} {IH T} {W AA Z} {W ER TH} {DH AH} {R IH S K}.|p225
  3 | wav22_silence_trimmed_wav/p225/p225_359_mic1.wav|{B AH T} {W IY} {W EH L K AH M} {DH IH S} {D AA K Y AH M EH N T}.|p225
  4 | wav22_silence_trimmed_wav/p226/p226_365_mic1.wav|{AY} {HH AE V} {G AA T} {AH} {W AY F} {T UW} {F IY D}.|p226
  5 | wav22_silence_trimmed_wav/p226/p226_366_mic1.wav|{IH T S} {N AA T} {P R IH T IY}, {B AH T} {IH T S} {IH F EH K T IH V}.|p226
  6 | wav22_silence_trimmed_wav/p226/p226_367_mic1.wav|{EH K W AH T IY} {D IH K L AY N D} {T UW} {K AA M EH N T}.|p226
  7 | wav22_silence_trimmed_wav/p227/p227_397_mic1.wav|{DH EY} {HH AE V} {N AA T} {G AA T} {EH N IY W AH N}.|p227
  8 | wav22_silence_trimmed_wav/p227/p227_398_mic1.wav|{AY} {K AE N} {HH AA R D L IY} {B IH L IY V} {IH T}.|p227
  9 | wav22_silence_trimmed_wav/p227/p227_399_mic1.wav|{HH AW} {D UW} {Y UW} {T EY K} {DH EH M} {AH W EY}?|p227
 10 | wav22_silence_trimmed_wav/p228/p228_366_mic1.wav|{B AH T} {AY} {M AE N AH JH D}.|p228
 11 | wav22_silence_trimmed_wav/p228/p228_367_mic1.wav|{B AY} {DH AE T} {T AY M}, {HH AW EH V ER}, {IH T} {W AA Z} {AO L R EH D IY} {T UW} {L EY T}.|p228
 12 | wav22_silence_trimmed_wav/p228/p228_368_mic1.wav|{AY} {AE M} {N AA T} {W IH L IH NG} {T UW} {S EY} {EH N IY TH IH NG} {AH B AW T} {EH N IY} {K AH P AH L}.|p228
 13 | wav22_silence_trimmed_wav/p229/p229_387_mic1.wav|{DH AE T} {T IY M} {IH Z} {D UW} {T UW} {B IY} {AH N AW N S T} {DH IH S} {M AO R N IH NG}.|p229
 14 | wav22_silence_trimmed_wav/p229/p229_388_mic1.wav|{IH T} {HH AE Z} {B IH N} {AH} {L AH V L IY} {F AE M AH L IY} {AH K EY ZH AH N}.|p229
 15 | wav22_silence_trimmed_wav/p229/p229_389_mic1.wav|{DH AE T S} {DH AH} {N EY M} {AH V} {DH AH} {G EY M}, {DH OW}.|p229
 16 | wav22_silence_trimmed_wav/p230/p230_411_mic1.wav|{T R IY T M AH N T} {IH Z} {N AA T} {AH N} {IH SH UW} {W IH DH} {DH IY Z} {P IY P AH L}.|p230
 17 | wav22_silence_trimmed_wav/p230/p230_413_mic1.wav|{IY V IH N} {IH F} {DH EY} {K AH M} {AW T} {P L EY IH NG} {AH} {F IH Z IH K AH L} {G EY M}, {W IY} {K AE N} {K OW P}.|p230
 18 | wav22_silence_trimmed_wav/p230/p230_414_mic1.wav|{IH T} {IH Z} {DH AH} {OW L D} {S T AO R IY}.|p230
 19 | wav22_silence_trimmed_wav/p231/p231_471_mic1.wav|{D EH N IH S} {W AA Z} {N AA T} {S OW} {SH UH R}.|p231
 20 | wav22_silence_trimmed_wav/p231/p231_472_mic1.wav|{HH IY} {F EH L T} {IH T} {W AA Z} {DH AH} {R AY T} {T AY M}.|p231
 21 | wav22_silence_trimmed_wav/p231/p231_473_mic1.wav|{IH T} {W AA Z} {K L IH R} {AA N} {TH ER Z D EY}.|p231
 22 | wav22_silence_trimmed_wav/p232/p232_410_mic1.wav|{AY} {W AA Z} {IH N} {AH} {P AH Z IH SH AH N} {T UW} {CH AE L AH N JH} {F AO R} {DH IH S} {IH V EH N T} {AE N D} {D IH D AH N T}.|p232
 23 | wav22_silence_trimmed_wav/p232/p232_411_mic1.wav|{Y UW} {W AO N T IH D} {DH AH} {EH V AH D AH N S}.|p232
 24 | wav22_silence_trimmed_wav/p232/p232_412_mic1.wav|{S IY N Y ER} {M AE N AH JH M AH N T} {IH N} {S K AA T L AH N D} {TH R UW} {IH T S} {W EY T} {B IH HH AY N D} {DH AH} {AO R K AH S T R AH}.|p232
 25 | wav22_silence_trimmed_wav/p233/p233_387_mic1.wav|{IH T} {W AA Z} {AH} {B R EH TH T EY K IH NG} {M OW M AH N T}.|p233
 26 | wav22_silence_trimmed_wav/p233/p233_388_mic1.wav|{AY} {W AA Z} {IH N} {JH EY L} {F AO R} {F AY V} {Y IH R Z}.|p233
 27 | wav22_silence_trimmed_wav/p233/p233_389_mic1.wav|{W AH T} {HH AE P AH N D} {IH N} {DH AH} {S AH M ER}?|p233
 28 | wav22_silence_trimmed_wav/p234/p234_356_mic1.wav|{DH AE T} {IH Z} {DH AH} {L EH S AH N} {AH V} {DH AH} {L AE S T} {TH R IY} {W IY K S}.|p234
 29 | wav22_silence_trimmed_wav/p234/p234_357_mic1.wav|{DH AH} {S IH T IY} {W EH L K AH M D} {DH AH} {N AH M B ER Z}.|p234
 30 | wav22_silence_trimmed_wav/p234/p234_358_mic1.wav|{S AH M TH IH NG} {IH Z} {G OW IH NG} {T UW} {HH AE V} {T UW} {G IH V}.|p234
 31 | wav22_silence_trimmed_wav/p236/p236_498_mic1.wav|{AY} {W AA Z} {R EH D IY} {F AO R} {DH IH S}.|p236
 32 | wav22_silence_trimmed_wav/p236/p236_499_mic1.wav|{IH N} {F AE K T}, {W IY} {W ER} {AO L} {OW V ER} {DH AH} {SH AA P}.|p236
 33 | wav22_silence_trimmed_wav/p236/p236_500_mic1.wav|{AH} {S K AA T S M AH N} {HH AE Z} {T UW} {D IH F EH N D} {HH IH Z} {K AE S AH L}.|p236
 34 | wav22_silence_trimmed_wav/p237/p237_346_mic1.wav|{DH AH} {AH N AW N S M EH N T} {W AA Z} {M EY D} {AE F T ER} {IH N K W AY ER IY Z} {F R AH M} {AH} {N AE SH AH N AH L} {N UW Z P EY P ER}.|p237
 35 | wav22_silence_trimmed_wav/p237/p237_347_mic1.wav|{IH N D IY D}, {IH T} {W AA Z} {M AE JH IH K AH L}.|p237
 36 | wav22_silence_trimmed_wav/p237/p237_348_mic1.wav|{L UH K} {AE T} {DH AH} {W IH T N AH S AH Z}.|p237
 37 | wav22_silence_trimmed_wav/p238/p238_454_mic1.wav|{S IH M AH L ER} {M EH ZH ER Z} {AA R} {IH K S P EH K T AH D} {IH N} {IH NG G L AH N D} {AE N D} {W EY L Z}.|p238
 38 | wav22_silence_trimmed_wav/p238/p238_455_mic1.wav|{S T EY JH K OW CH} {IH Z} {S IY K IH NG} {AH} {N UW} {F AH N AE N S} {D ER EH K T ER}.|p238
 39 | wav22_silence_trimmed_wav/p238/p238_456_mic1.wav|{N AW} {IH T} {HH AE Z} {D AH B AH L D} {IH N} {S AY Z}.|p238
 40 | wav22_silence_trimmed_wav/p239/p239_498_mic1.wav|{S AE T ER D IY Z} {M AE CH} {W AA Z} {F EH R L IY} {S T R EY T F AO R W ER D}.|p239
 41 | wav22_silence_trimmed_wav/p239/p239_499_mic1.wav|{AY} {AE M} {N AA T} {AH N D UW L IY} {S ER P R AY Z D}.|p239
 42 | wav22_silence_trimmed_wav/p239/p239_500_mic1.wav|{AY AH N} {D AH NG K AH N} {S M IH TH} {IH Z} {R AO NG}.|p239
 43 | wav22_silence_trimmed_wav/p240/p240_375_mic1.wav|{W IY} {AA R} {L UH K IH NG} {F AO R} {P ER F EH K SH AH N}.|p240
 44 | wav22_silence_trimmed_wav/p240/p240_376_mic1.wav|{AA N} {F Y UW AH L}, {DH AH} {CH AE N S AH L ER} {HH AE Z} {AH} {N AH M B ER} {AH V} {AA P SH AH N Z}.|p240
 45 | wav22_silence_trimmed_wav/p240/p240_377_mic1.wav|{HH IY} {L EY T ER} {B IH K EY M} {AH} {R IH S P EH K T IH D} {HH AY} {K AO R T} {JH AH JH}.|p240
 46 | wav22_silence_trimmed_wav/p241/p241_369_mic1.wav|{DH EH R} {W AA Z} {N OW} {HH IH N T} {AH V} {S K AE N D AH L}.|p241
 47 | wav22_silence_trimmed_wav/p241/p241_370_mic1.wav|{S EY F T IY} {W AA Z} {AO L S OW} {AH N} {IH SH UW}.|p241
 48 | wav22_silence_trimmed_wav/p241/p241_371_mic1.wav|{W EH L}, {DH AH} {B IH G} {D EY} {HH AE Z} {ER AY V D}.|p241
 49 | wav22_silence_trimmed_wav/p243/p243_393_mic1.wav|{B AH T} {S UW N} {DH AH} {G EY M} {D R AA P T} {B AE K} {IH N T UW} {IH T S} {OW L D} {W EY Z}.|p243
 50 | wav22_silence_trimmed_wav/p243/p243_394_mic1.wav|{IH N} {M AY} {AH P IH N Y AH N}, {EH N IY} {T R AE N S F ER} {T UW} {AH} {F AO R AH N} {K L AH B} {W IH L} {HH EH L P} {HH IH M} {P R AA G R AH S}.|p243
 51 | wav22_silence_trimmed_wav/p243/p243_395_mic1.wav|{IH N} {DH AE T} {S IH CH UW EY SH AH N}, {DH EY} {TH AO T} {W IY} {SH UH D} {B IY} {B IY T AH N}.|p243
 52 | wav22_silence_trimmed_wav/p244/p244_419_mic1.wav|{B IH L D IH NG} {AH N} {AH N D ER G R AW N D}, {F AO R} {IH G Z AE M P AH L}, {HH AE Z} {P R UW V D} {AH} {N AY T M EH R}.|p244
 53 | wav22_silence_trimmed_wav/p244/p244_420_mic1.wav|{DH AE T} {M AY T} {M IY N} {AH N AH DH ER} {D IH L EY}.|p244
 54 | wav22_silence_trimmed_wav/p244/p244_421_mic1.wav|{W AH N} {D EY}, {HH IY} {TH R UW} {DH AH} {B EY B IY} {AH G EH N S T} {AH} {W AO L}.|p244
 55 | wav22_silence_trimmed_wav/p245/p245_354_mic1.wav|{IH T} {W AA Z} {V EH R IY} {F AO R M AH L}.|p245
 56 | wav22_silence_trimmed_wav/p245/p245_355_mic1.wav|{DH AH} {P R EH SH ER} {IH Z} {AA N} {S EH L T IH K} {IH N} {DH AH} {M EY N}.|p245
 57 | wav22_silence_trimmed_wav/p245/p245_356_mic1.wav|{IH T S} {N AA T} {DH AE T} {K L IH R}-{K AH T}.|p245
 58 | wav22_silence_trimmed_wav/p246/p246_355_mic1.wav|{IH T} {IH Z} {AH} {M AH M AO R IY AH L}.|p246
 59 | wav22_silence_trimmed_wav/p246/p246_357_mic1.wav|{HH IY} {IH Z} {V EH R IY} {K AH N S ER N D} {AH B AW T} {HH IH Z} {F IH T N AH S} {AE Z} {AH} {F UH T B AO L ER}.|p246
 60 | wav22_silence_trimmed_wav/p246/p246_358_mic1.wav|{DH EH R} {T AY M} {HH AE Z} {G AO N}.|p246
 61 | wav22_silence_trimmed_wav/p247/p247_473_mic1.wav|{W IY} {W IH L} {N AW} {L UH K} {IH N T UW} {IH T S} {HH IH S T AO R IH K AH L} {B AE K G R AW N D}.|p247
 62 | wav22_silence_trimmed_wav/p247/p247_474_mic1.wav|{DH EY} {W ER} {IH M P R EH S IH V} {AH G EH N S T} {F R AE N S}.|p247
 63 | wav22_silence_trimmed_wav/p247/p247_475_mic1.wav|{DH AH} {CH IY F} {K AA N S T AH B AH L} {HH AE Z} {R IH T AY R D}.|p247
 64 | wav22_silence_trimmed_wav/p248/p248_371_mic1.wav|{W EH DH ER} {HH IH Z} {S T AE N S} {IH Z} {SH EH R D} {B AY} {DH AH} {IH N K AH M IH NG} {M AE N AH JH ER} {IH Z} {AH N AH DH ER} {M AE T ER}.|p248
 65 | wav22_silence_trimmed_wav/p248/p248_372_mic1.wav|{K AH L ER} {W AA Z} {AE T} {DH AH} {K AO R} {AH V} {HH IH Z} {L AY F}.|p248
 66 | wav22_silence_trimmed_wav/p248/p248_373_mic1.wav|{HH IY Z} {D IH L AY T AH D}, {T UW}, {W IH DH} {DH AH} {N UW} {P R EH M AH S AH Z}.|p248
 67 | wav22_silence_trimmed_wav/p249/p249_349_mic1.wav|{DH AH} {D IH S IH ZH AH N} {IH Z} {AH N} {AE B S AH L UW T} {D IH S G R EY S}.|p249
 68 | wav22_silence_trimmed_wav/p249/p249_350_mic1.wav|{IH T} {IH Z} {HH AO R AH B AH L}.|p249
 69 | wav22_silence_trimmed_wav/p249/p249_351_mic1.wav|{IH T} {W AA Z} {V EH R IY} {F AO R M AH L}.|p249
 70 | wav22_silence_trimmed_wav/p250/p250_491_mic1.wav|{AY} {T IY} {EY CH} {EY} {EH S} {B IH N} {DH AH} {Y IH R} {AH V} {DH AH} {Y AH NG S T ER} {AE T} {K IH L M AA R N AH K}.|p250
 71 | wav22_silence_trimmed_wav/p250/p250_492_mic1.wav|{DH IY Z} {P IY P AH L} {AH T AE K} {DH AH} {K AO R} {AH V} {M AY} {B IH L IY F S}.|p250
 72 | wav22_silence_trimmed_wav/p250/p250_493_mic1.wav|{DH AH} {S T R AY K ER} {N IY D AH D} {AH T EH N SH AH N} {B IH F AO R} {HH IY} {K UH D} {R IH Z UW M}.|p250
 73 | wav22_silence_trimmed_wav/p251/p251_367_mic1.wav|{HH IY} {HH AE Z} {S IY N} {DH AH} {P AE S T}.|p251
 74 | wav22_silence_trimmed_wav/p251/p251_368_mic1.wav|{W IY} {W IH L} {B IY} {HH OW M L AH S}.|p251
 75 | wav22_silence_trimmed_wav/p251/p251_369_mic1.wav|{DH AE T} {W UH D} {HH EH L P}.|p251
 76 | wav22_silence_trimmed_wav/p252/p252_406_mic1.wav|{D AH Z} {IH T} {M AE T ER}?|p252
 77 | wav22_silence_trimmed_wav/p252/p252_407_mic1.wav|{DH AH} {D AA K T ER Z} {AA R} {K W AY T} {P AA Z AH T IH V} {AH B AW T} {M AY} {P R AA G R AH S}.|p252
 78 | wav22_silence_trimmed_wav/p252/p252_408_mic1.wav|{DH EY} {HH AE D} {T UW} {HH AE V} {HH AA S P IH T AH L} {T R IY T M AH N T}.|p252
 79 | wav22_silence_trimmed_wav/p253/p253_404_mic1.wav|{EH R Z} {TH AE CH ER}{AA Z} {IH N} {DH AH} {R AY T} {P L EY S}, {AE T} {DH AH} {R AY T} {T AY M}.|p253
 80 | wav22_silence_trimmed_wav/p253/p253_405_mic1.wav|{AH N T IH L} {DH EY} {K EY M} {T UW} {D UW} {IH T}.|p253
 81 | wav22_silence_trimmed_wav/p253/p253_407_mic1.wav|{DH AE T} {M EY} {B IY}.|p253
 82 | wav22_silence_trimmed_wav/p254/p254_398_mic1.wav|{AH R IH K S AH N} {W UH D} {HH AE V} {AH P R UW V D}.|p254
 83 | wav22_silence_trimmed_wav/p254/p254_399_mic1.wav|{P Y UW P AH L Z} {W ER} {AH L AW D} {HH OW M} {AE T} {L AH N CH T AY M}.|p254
 84 | wav22_silence_trimmed_wav/p254/p254_400_mic1.wav|{IH T} {W AA Z} {AH} {B R EH TH T EY K IH NG} {M OW M AH N T}.|p254
 85 | wav22_silence_trimmed_wav/p255/p255_376_mic1.wav|{K AH L ER} {W AA Z} {AE T} {DH AH} {K AO R} {AH V} {HH IH Z} {L AY F}.|p255
 86 | wav22_silence_trimmed_wav/p255/p255_377_mic1.wav|{HH IY Z} {D IH L AY T AH D}, {T UW}, {W IH DH} {DH AH} {N UW} {P R EH M AH S AH Z}.|p255
 87 | wav22_silence_trimmed_wav/p255/p255_378_mic1.wav|{IH T} {W AA Z} {AE Z} {IH F} {IH T} {W AA Z} {AO L} {HH AE P AH N IH NG} {AE T} {AH} {G AA R D AH N} {P AA R T IY}.|p255
 88 | wav22_silence_trimmed_wav/p256/p256_315_mic1.wav|{DH EY} {W ER} {F AE N T AE S T IH K}.|p256
 89 | wav22_silence_trimmed_wav/p256/p256_316_mic1.wav|{DH EH R} {IH Z} {N OW} {AH DH ER} {S AH L UW SH AH N} {T UW} {K AH N JH EH S CH AH N} {IH N} {EH D AH N B ER OW}.|p256
 90 | wav22_silence_trimmed_wav/p256/p256_317_mic1.wav|{SH IY} {IH Z} {V EH R IY} {D IH S T IH NG K T IH V}.|p256
 91 | wav22_silence_trimmed_wav/p257/p257_429_mic1.wav|{IH T S} {AH} {M IH R AH K AH L}.|p257
 92 | wav22_silence_trimmed_wav/p257/p257_430_mic1.wav|{DH AH} {F ER S T} {M IH N AH S T ER} {IH Z} {AA B V IY AH S L IY} {K AH N S ER N D} {T UW} {HH IY R} {AH B AW T} {DH IH S} {IH N S AH D AH N T}.|p257
 93 | wav22_silence_trimmed_wav/p257/p257_431_mic1.wav|{DH AE T} {W AA Z} {DH AH} {F ER S T} {T AY M} {AY} {W ER K T} {W IH DH} {R IH CH ER D}.|p257
 94 | wav22_silence_trimmed_wav/p258/p258_409_mic1.wav|{DH AH} {V IH ZH W AH L} {AA R T S} {K AH M IH T IY} {T UH K} {DH IH S} {D IH S IH ZH AH N} {IH N} {D IH S EH M B ER}.|p258
 95 | wav22_silence_trimmed_wav/p258/p258_410_mic1.wav|{AH DH ER} {M EH M B ER Z} {AH V} {DH AH} {F AE M AH L IY} {W ER} {T UW} {AH P S EH T} {T UW} {K AA M EH N T} {L AE S T} {N AY T}.|p258
 96 | wav22_silence_trimmed_wav/p258/p258_411_mic1.wav|{AY} {B IH L IY V} {DH AH} {G AE P} {IH Z} {N IH R L IY} {DH EH R}.|p258
 97 | wav22_silence_trimmed_wav/p259/p259_476_mic1.wav|{HH IY} {D IH Z ER V D} {AH} {R EH D} {K AA R D}.|p259
 98 | wav22_silence_trimmed_wav/p259/p259_477_mic1.wav|{DH AH} {P R AY S} {P EY D} {F AO R} {IH T} {W AA Z} {N AA T} {D IH S K L OW Z D}.|p259
 99 | wav22_silence_trimmed_wav/p259/p259_478_mic1.wav|{Y UW} {HH AE V} {T UW} {HH AE V} {AH} {P R UW V AH N} {T R AE K} {R EH K ER D} {T UW} {G EH T} {DH EH M}.|p259
100 | wav22_silence_trimmed_wav/p260/p260_352_mic1.wav|{DH EY} {SH UH D} {T EY K} {DH EH R} {M OW B AH L} {F OW N Z}.|p260
101 | wav22_silence_trimmed_wav/p260/p260_353_mic1.wav|{EY T} {M AH N TH S} {L EY T ER}, {HH IY} {W AA Z} {D EH D}.|p260
102 | wav22_silence_trimmed_wav/p260/p260_354_mic1.wav|{AY} {EY CH} {EY} {V IY} {IY} {AH} {D R IY M}.|p260
103 | wav22_silence_trimmed_wav/p261/p261_469_mic1.wav|{DH EY} {AA R} {D IH F AY N D} {B AY} {L AO}.|p261
104 | wav22_silence_trimmed_wav/p261/p261_470_mic1.wav|{IH F} {DH AE T S} {DH AH} {K EY S}, {HH IY} {W IH L} {S T R AH G AH L}.|p261
105 | wav22_silence_trimmed_wav/p261/p261_471_mic1.wav|{AY M} {V EH R IY} {P L IY Z D} {F AO R} {DH AH} {K L AH B} {AE N D} {F AO R} {M AY S EH L F}.|p261
106 | wav22_silence_trimmed_wav/p262/p262_389_mic1.wav|{IH T} {W AA Z} {AH} {B R EH TH T EY K IH NG} {M OW M AH N T}.|p262
107 | wav22_silence_trimmed_wav/p262/p262_390_mic1.wav|{IH T} {W AA Z} {AA N} {F AY ER}.|p262
108 | wav22_silence_trimmed_wav/p262/p262_391_mic1.wav|{DH IH S} {IH Z} {AH} {M EY JH ER} {S T EH P} {F AO R W ER D} {F AO R} {K EH R ER Z}.|p262
109 | wav22_silence_trimmed_wav/p263/p263_468_mic1.wav|{IH T S} {B IH N} {AH} {L AO NG}, {L AO NG} {JH ER N IY}.|p263
110 | wav22_silence_trimmed_wav/p263/p263_469_mic1.wav|{AY} {JH AH S T} {G AA T} {AH N D ER} {DH AH} {B AO L} {AH} {B IH T}.|p263
111 | wav22_silence_trimmed_wav/p263/p263_470_mic1.wav|{DH EH R} {IH Z} {AH} {S IH M AH L ER} {S T AO R IY} {F AO R} {M IH L K} {AE N D} {D EH R IY} {P R AA D AH K T S}.|p263
112 | wav22_silence_trimmed_wav/p264/p264_490_mic1.wav|{W IY V} {G AA T} {G UH D} {AA P SH AH N Z}.|p264
113 | wav22_silence_trimmed_wav/p264/p264_491_mic1.wav|{DH IH S} {AA P ER EY SH AH N} {W IH L} {CH EY N JH} {HH ER} {L AY F}.|p264
114 | wav22_silence_trimmed_wav/p264/p264_492_mic1.wav|{N UW} {Y AO R K} {IH Z} {M AY} {HH OW M}.|p264
115 | wav22_silence_trimmed_wav/p265/p265_347_mic1.wav|{DH AH} {F AY N AH L} {D IH S IH ZH AH N} {W AA Z} {B IH T W IY N} {S K AA T L AH N D} {AE N D} {DH AH} {R IY P AH B L AH K} {AH V} {AY ER L AH N D}.|p265
116 | wav22_silence_trimmed_wav/p265/p265_348_mic1.wav|{AH} {R IY V Y UW} {IH Z} {AH N D ER} {W EY}.|p265
117 | wav22_silence_trimmed_wav/p265/p265_349_mic1.wav|{S EY F T IY} {W AA Z} {AO L S OW} {AH N} {IH SH UW}.|p265
118 | wav22_silence_trimmed_wav/p266/p266_419_mic1.wav|{IH T} {IH Z} {L IY G AH L IY} {B AY N D IH NG}.|p266
119 | wav22_silence_trimmed_wav/p266/p266_420_mic1.wav|{P EH R IH S} {IH Z} {N AA T} {OW N L IY} {Y AH NG}.|p266
120 | wav22_silence_trimmed_wav/p266/p266_421_mic1.wav|{S AH CH} {AE Z} {IH T} {IH Z}.|p266
121 | wav22_silence_trimmed_wav/p267/p267_416_mic1.wav|{B AY} {DH EH N}, {AH} {M AE S IH V} {L IY G AH L} {B AE T AH L} {IH Z} {L AY K L IY} {T UW} {HH AE V} {S T AA R T AH D}.|p267
122 | wav22_silence_trimmed_wav/p267/p267_417_mic1.wav|{DH IH S} {IH Z} {N OW} {R AH F L EH K SH AH N} {AA N} {R EY N JH ER Z}.|p267
123 | wav22_silence_trimmed_wav/p267/p267_418_mic1.wav|{F ER DH ER} {N UW} {EH K W AH T IY} {W AA Z} {R EY Z D} {IH N} {AH} {P L EY S IH NG} {IH N} {JH AE N Y UW EH R IY}.|p267
124 | wav22_silence_trimmed_wav/p268/p268_406_mic1.wav|{W AA SH IH NG T AH N} {IH Z} {K AH N S UW M D} {B AY} {DH AH} {K R AY S AH S}.|p268
125 | wav22_silence_trimmed_wav/p268/p268_407_mic1.wav|{N AW}, {S AH D AH N L IY}, {W IY} {HH AE V} {DH IH S} {N UW} {L AE N D S K EY P}.|p268
126 | wav22_silence_trimmed_wav/p268/p268_408_mic1.wav|{AY} {SH UH D} {TH IH NG K} {S OW}, {T UW}.|p268
127 | wav22_silence_trimmed_wav/p269/p269_398_mic1.wav|{AH P AA R T} {F R AH M} {DH AH} {R IH Z AH L T} {W IY} {HH AE V} {T UW} {B IY} {HH AE P IY} {W IH DH} {AW ER} {P ER F AO R M AH N S}.|p269
128 | wav22_silence_trimmed_wav/p269/p269_399_mic1.wav|{W EH L}, {IH T} {D IH D} {L AE S T} {T AY M}, {HH IY} {W AA Z} {R IY M AY N D AH D}.|p269
129 | wav22_silence_trimmed_wav/p269/p269_400_mic1.wav|{B AY} {DH EH N}, {AH} {M AE S IH V} {L IY G AH L} {B AE T AH L} {IH Z} {L AY K L IY} {T UW} {HH AE V} {S T AA R T AH D}.|p269
130 | wav22_silence_trimmed_wav/p270/p270_457_mic1.wav|{W IY} {W IH L} {D IY L} {W IH DH} {DH AH} {R EH F Y UW JH IY Z}.|p270
131 | wav22_silence_trimmed_wav/p270/p270_458_mic1.wav|{IH F} {DH AE T S} {DH AH} {K EY S}, {HH IY} {W IH L} {S T R AH G AH L}.|p270
132 | wav22_silence_trimmed_wav/p270/p270_459_mic1.wav|{W IY} {AA R} {OW V ER} {DH AH} {M UW N}.|p270
133 | wav22_silence_trimmed_wav/p271/p271_449_mic1.wav|{IH T} {W IH L} {D AH T ER M AH N} {W EH DH ER} {AH N} {AH F EH N S} {HH AE Z} {AH K ER D}.|p271
134 | wav22_silence_trimmed_wav/p271/p271_450_mic1.wav|{W AA Z} {IH T} {W ER TH} {IH T}?|p271
135 | wav22_silence_trimmed_wav/p271/p271_451_mic1.wav|{IH T} {W AA Z} {IH G N AO R D}.|p271
136 | wav22_silence_trimmed_wav/p272/p272_407_mic1.wav|{HH AW EH V ER}, {DH AH} {P L EY ER Z} {SH UH D} {HH AE V} {AH} {V OY S} {IH N} {DH IY Z} {M AE T ER Z}.|p272
137 | wav22_silence_trimmed_wav/p272/p272_408_mic1.wav|{HH IY} {HH AE Z} {R IH T AH N} {T UW} {DH AH} {M IH N AH S T ER} {AE F T ER} {M IY T IH NG Z} {AA N} {DH AH} {AY L AH N D}.|p272
138 | wav22_silence_trimmed_wav/p272/p272_409_mic1.wav|{B AH T} {AY} {F EH L T} {IH T} {W AA Z} {IH M P AO R T AH N T} {T UW} {IH N T R AH D UW S} {DH AH} {EH L AH M AH N T} {AH V} {T R AH D IH SH AH N}.|p272
139 | wav22_silence_trimmed_wav/p273/p273_431_mic1.wav|{DH EY} {AA R} {N AA T} {L EH F T} {W IH NG}.|p273
140 | wav22_silence_trimmed_wav/p273/p273_432_mic1.wav|{IH T} {IH Z} {N AA T} {AH} {S T AE N D IH NG} {AA R M IY}.|p273
141 | wav22_silence_trimmed_wav/p273/p273_433_mic1.wav|{S T R EY N JH L IY} {IH N AH F} {AY} {F EH L T} {V EH R IY} {SH AA R P}.|p273
142 | wav22_silence_trimmed_wav/p274/p274_465_mic1.wav|{DH AH} {EH D AH N B ER OW} {AA D IY AH N S} {W AA Z} {EY B AH L} {T UW} {AH N D ER S T AE N D} {DH AH} {D AY AH L AO G}.|p274
143 | wav22_silence_trimmed_wav/p274/p274_466_mic1.wav|{F IH L} {M IH K EH L S AH N} {D IH D} {DH AE T} {L AE S T} {Y IH R}.|p274
144 | wav22_silence_trimmed_wav/p274/p274_467_mic1.wav|{DH AH} {AE N S ER} {W AA Z} {AH P} {DH EH R} {AA N} {DH AH} {S T EY JH}.|p274
145 | wav22_silence_trimmed_wav/p275/p275_424_mic1.wav|{W IY} {AA R} {N AW} {AH P} {AH G EH N S T} {IH T}.|p275
146 | wav22_silence_trimmed_wav/p275/p275_425_mic1.wav|{L AE S T} {Y IH R}, {IH T} {W AA Z} {W AH N} {B AY} {JH AE K} {M AH K AA N AH L}, {DH AH} {F ER S T} {M IH N AH S T ER}.|p275
147 | wav22_silence_trimmed_wav/p275/p275_426_mic1.wav|{AY} {W AA N T} {AH} {K ER IH R}.|p275
148 | wav22_silence_trimmed_wav/p276/p276_460_mic1.wav|{W IY} {W IH L} {P EY} {DH EH R} {B IH L Z}.|p276
149 | wav22_silence_trimmed_wav/p276/p276_461_mic1.wav|{DH IH S} {K AO R T} {HH AE Z} {M EY D} {AH N} {AO R D ER} {W IH CH} {HH AE Z} {N AA T} {B IH N} {AH B Z ER V D}.|p276
150 | wav22_silence_trimmed_wav/p276/p276_462_mic1.wav|{IH T} {W AA Z} {V EH R IY} {P AA Z AH T IH V} {AE N D} {V EH R IY} {HH EH L P F AH L}.|p276
151 | wav22_silence_trimmed_wav/p277/p277_458_mic1.wav|{S K AA T IH SH} {P AH B L IH K} {F IH N AE N S IH Z} {IH M ER JH} {F R AH M} {DH IH S} {R IY V Y UW} {EH N HH AE N S T}.|p277
152 | wav22_silence_trimmed_wav/p277/p277_459_mic1.wav|{SH IY} {AO L S OW} {D IH F EH N D AH D} {DH AH} {L AO R D} {CH AE N S AH L ER Z} {IH G Z IH S T IH NG} {P AW ER Z}.|p277
153 | wav22_silence_trimmed_wav/p277/p277_460_mic1.wav|{AY D} {L AH V} {T UW} {B IY} {L AY K} {P IY T ER}.|p277
154 | wav22_silence_trimmed_wav/p278/p278_405_mic1.wav|{W IY} {M AH S T} {IH M P R UW V} {AW ER} {R IY L EY SH AH N Z} {W IH DH} {G AH V ER N M AH N T}.|p278
155 | wav22_silence_trimmed_wav/p278/p278_406_mic1.wav|{DH AH} {K AE B AH N AH T} {IH Z} {S P L IH T} {OW V ER} {DH AH} {IH SH UW}.|p278
156 | wav22_silence_trimmed_wav/p278/p278_407_mic1.wav|{M EH S} {M AH N EY L} {W AA Z} {K IH L D} {AA N} {IH M P AE K T}.|p278
157 | wav22_silence_trimmed_wav/p279/p279_401_mic1.wav|{N OW B AA D IY} {IY V IH N} {N UW} {IH T} {HH AE D} {HH AE P AH N D}.|p279
158 | wav22_silence_trimmed_wav/p279/p279_402_mic1.wav|{W AH T} {W UH D} {B IY} {DH AH} {P OY N T}?|p279
159 | wav22_silence_trimmed_wav/p279/p279_403_mic1.wav|{DH AH} {W IY K L IY} {AE V ER IH JH} {W AA Z} {TH R IY} {AW ER Z}.|p279
160 | wav22_silence_trimmed_wav/p281/p281_455_mic1.wav|{SH IY} {HH AE Z} {N AW} {B IH N} {R EH JH IH S T ER D} {AE Z} {D IH S EY B AH L D}.|p281
161 | wav22_silence_trimmed_wav/p281/p281_456_mic1.wav|{W IY} {K UH D} {HH AA R D L IY} {B IH L IY V} {IH T}.|p281
162 | wav22_silence_trimmed_wav/p281/p281_457_mic1.wav|{DH AH} {M EH N} {AA R} {V EH R IY} {W ER IY D}.|p281
163 | wav22_silence_trimmed_wav/p282/p282_365_mic1.wav|{B AY} {DH AE T} {T AY M}, {HH AW EH V ER}, {IH T} {W AA Z} {AO L R EH D IY} {T UW} {L EY T}.|p282
164 | wav22_silence_trimmed_wav/p282/p282_366_mic1.wav|{IH T} {W AA Z} {AH} {F AH N IY} {G EY M}.|p282
165 | wav22_silence_trimmed_wav/p282/p282_367_mic1.wav|{DH EH R} {W AA Z AH N T} {AH} {G OW L}.|p282
166 | wav22_silence_trimmed_wav/p283/p283_466_mic1.wav|{DH AH} {EH D AH N B ER OW} {AA D IY AH N S} {W AA Z} {EY B AH L} {T UW} {AH N D ER S T AE N D} {DH AH} {D AY AH L AO G}.|p283
167 | wav22_silence_trimmed_wav/p283/p283_467_mic1.wav|{IH T} {AO L} {B IH G AE N} {AE Z} {AH N} {AE K S AH D AH N T}.|p283
168 | wav22_silence_trimmed_wav/p283/p283_468_mic1.wav|{W IY} {AA R} {G OW IH NG} {TH R UW} {DH AH} {P R AA S EH S}.|p283
169 | wav22_silence_trimmed_wav/p284/p284_420_mic1.wav|{HH IY} {W AA Z} {K AH N V IH K T AH D} {AE T} {G L AE S K OW} {SH EH R AH F} {K AO R T}.|p284
170 | wav22_silence_trimmed_wav/p284/p284_421_mic1.wav|{AH R IH K S AH N} {W UH D} {HH AE V} {AH P R UW V D}.|p284
171 | wav22_silence_trimmed_wav/p284/p284_422_mic1.wav|{AY V} {IH N V EH N T AH D} {AH} {V IH L AH JH} {IH N} {IY S T} {L AA TH IY AH N}.|p284
172 | wav22_silence_trimmed_wav/p285/p285_398_mic1.wav|{TH AE NG K F AH L IY}, {N OW}-{W AH N} {AA N} {DH AH} {B UH S} {IH Z} {T UW} {B AE D L IY} {HH ER T}.|p285
173 | wav22_silence_trimmed_wav/p285/p285_399_mic1.wav|{IH Z} {DH AE T} {P AA S AH B AH L}?|p285
174 | wav22_silence_trimmed_wav/p285/p285_400_mic1.wav|{B AH T}, {IH N} {F AE K T}, {DH AH} {R IH V ER S} {IH Z} {T R UW}.|p285
175 | wav22_silence_trimmed_wav/p286/p286_463_mic1.wav|{W IY} {AA R} {N AW} {AH P} {AH G EH N S T} {IH T}.|p286
176 | wav22_silence_trimmed_wav/p286/p286_464_mic1.wav|{F EY L Y ER} {T UW} {R IY AE K T} {T UW} {IY CH} {W AH N} {K UH D} {M IY N} {AH} {D IH Z AE S T ER}.|p286
177 | wav22_silence_trimmed_wav/p286/p286_465_mic1.wav|{IH T} {IH Z} {V AY T AH L} {T UW} {K L EH R AH F AY} {DH EH R} {R OW L}.|p286
178 | wav22_silence_trimmed_wav/p287/p287_419_mic1.wav|{IH T} {W AA Z} {IH M P AO R T AH N T} {T UW} {W IH N} {DH AH} {S IH NG G AH L Z}.|p287
179 | wav22_silence_trimmed_wav/p287/p287_420_mic1.wav|{AY M} {P L IY Z D} {AH B AW T} {W AH N} {TH IH NG}.|p287
180 | wav22_silence_trimmed_wav/p287/p287_421_mic1.wav|{DH AE T S} {B IH N} {AO L} {ER AW N D} {Y UH R AH P} {W IH DH} {M IY}.|p287
181 | wav22_silence_trimmed_wav/p288/p288_407_mic1.wav|{DH AH} {R IY L} {EH N AH M IY} {IH Z} {IH N} {Y AO R} {OW N} {B AE K} {Y AA R D}.|p288
182 | wav22_silence_trimmed_wav/p288/p288_408_mic1.wav|{DH AH} {V IH ZH W AH L} {AA R T S} {K AH M IH T IY} {T UH K} {DH IH S} {D IH S IH ZH AH N} {IH N} {D IH S EH M B ER}.|p288
183 | wav22_silence_trimmed_wav/p288/p288_409_mic1.wav|{W IY} {N IY D} {DH AH} {CH IY F} {M EH D AH K AH L} {AO F AH S ER} {T UW} {K L EH R AH F AY} {DH AH} {M AE T ER}.|p288
184 | wav22_silence_trimmed_wav/p292/p292_419_mic1.wav|{B AH T} {AH} {S ER P R AY Z} {IH Z} {IH N} {S T AO R}.|p292
185 | wav22_silence_trimmed_wav/p292/p292_420_mic1.wav|{Y UW} {K AE N} {K OW P} {W IH DH} {IH T}.|p292
186 | wav22_silence_trimmed_wav/p292/p292_421_mic1.wav|{HH IY} {W AA Z} {B AE K} {T UW} {S K W EH R} {W AH N}.|p292
187 | wav22_silence_trimmed_wav/p293/p293_395_mic1.wav|{DH IH S} {IH Z} {AH} {K AH M P L IY T L IY} {N UW} {IH K S P IH R IY AH N S} {F AO R} {M IY}.|p293
188 | wav22_silence_trimmed_wav/p293/p293_396_mic1.wav|{AY} {SH UH D} {N EH V ER} {HH AE V} {K AH M} {OW V ER} {T UW} {F AY F}.|p293
189 | wav22_silence_trimmed_wav/p293/p293_397_mic1.wav|{W AH T} {K AY N D} {AH V} {P ER S AH N} {IH Z} {HH IY}?|p293
190 | wav22_silence_trimmed_wav/p294/p294_419_mic1.wav|{W IY} {HH AE V} {EH V ER IY} {R IH S P EH K T} {F AO R} {M IH SH EH L}.|p294
191 | wav22_silence_trimmed_wav/p294/p294_420_mic1.wav|{M R} {B L AH NG K AH T} {W AA Z} {K AH N V IH N S T}.|p294
192 | wav22_silence_trimmed_wav/p294/p294_421_mic1.wav|{DH AH} {AA K SH AH N} {W IH L} {B IY} {HH EH L D} {T AH M AA R OW}.|p294
193 | wav22_silence_trimmed_wav/p295/p295_419_mic1.wav|{EH V R IY TH IH NG} {IH Z} {S ER AW N D AH D} {B AY} {K AH N F Y UW ZH AH N}.|p295
194 | wav22_silence_trimmed_wav/p295/p295_420_mic1.wav|{B AH T}, {AH V} {K AO R S}, {IH T} {IH Z AH N T}.|p295
195 | wav22_silence_trimmed_wav/p295/p295_421_mic1.wav|{W IY} {AA R} {AO L M OW S T} {AE T} {DH AE T} {P OY N T}, {B AH T} {N AA T} {K W AY T}.|p295
196 | wav22_silence_trimmed_wav/p297/p297_419_mic1.wav|{HH AA L IY W UH D} {K AA M AH D IY} {IH Z} {HH AE V IH NG} {AH} {G UH D} {W IY K}.|p297
197 | wav22_silence_trimmed_wav/p297/p297_420_mic1.wav|{Y EH T} {DH AH} {K AH N S EH N S AH S} {IH Z}, {IH T S} {W ER TH} {IH T}.|p297
198 | wav22_silence_trimmed_wav/p297/p297_421_mic1.wav|{IH T} {EH S} {DH AH} {T IH P} {AH V} {DH AH} {AY S B ER G}.|p297
199 | wav22_silence_trimmed_wav/p298/p298_400_mic1.wav|{DH AH} {S AW TH} {AE F R IH K AA N} {D IH F EH N D AH D} {HH IH Z} {W ER K} {EH TH AH K}.|p298
200 | wav22_silence_trimmed_wav/p298/p298_401_mic1.wav|{DH AH} {OW L IH M P IH K} {K AH M IH T IY} {SH UH D} {B IY} {AH SH EY M D} {AH V} {DH EH M S EH L V Z}.|p298
201 | wav22_silence_trimmed_wav/p298/p298_402_mic1.wav|{Y UW} {AA R} {G OW IH NG} {T UW} {G L AE S K OW} {EH R P AO R T}, {AE N D} {N AA T} {K AH M IH NG} {B AE K}.|p298
202 | wav22_silence_trimmed_wav/p299/p299_400_mic1.wav|{W IY} {AA R} {W ER K IH NG} {AA N} {DH AH} {T UW} {TH IH NG Z}.|p299
203 | wav22_silence_trimmed_wav/p299/p299_401_mic1.wav|{W IY} {OW N L IY} {L AO S T} {DH AH} {TH ER D} {G OW L}.|p299
204 | wav22_silence_trimmed_wav/p299/p299_402_mic1.wav|{W IH M AH N} {R IY P AO R T AH D} {M AO R} {D IH P R EH SH AH N} {DH AE N} {M EH N}.|p299
205 | wav22_silence_trimmed_wav/p300/p300_395_mic1.wav|{V IH K T AH R IY}, {SH IY} {IH N S IH S T AH D} {Y EH S T ER D EY}, {W AA Z} {F AA R} {M AO R} {IH M P AO R T AH N T} {DH AE N} {DH AH} {R IH W AO R D Z}.|p300
206 | wav22_silence_trimmed_wav/p300/p300_396_mic1.wav|{F AO R} {AO L} {HH IH Z} {S AH K S EH S AH Z}, {HH IY} {IH Z} {AH K AH S T AH M D} {T UW} {W EY T IH NG} {F AO R} {F UH L F IH L M AH N T}.|p300
207 | wav22_silence_trimmed_wav/p300/p300_397_mic1.wav|{AH} {F EY T AH L} {AE K S AH D AH N T} {IH N K W AY R IY} {IH Z} {AO NG G OW IH NG}.|p300
208 | wav22_silence_trimmed_wav/p301/p301_406_mic1.wav|{JH AE K S AH N} {M EY} {W EH L} {B IY} {R AY T}.|p301
209 | wav22_silence_trimmed_wav/p301/p301_407_mic1.wav|{M AY} {TH AO T S} {AA R} {W IH DH} {DH EH R} {F AE M AH L IY Z}.|p301
210 | wav22_silence_trimmed_wav/p301/p301_408_mic1.wav|{AY} {S AO} {S AH M} {G UH D} {TH IH NG Z}.|p301
211 | wav22_silence_trimmed_wav/p302/p302_311_mic1.wav|{IH T} {P L AE N Z} {T UW} {R IH T ER N} {T UW} {DH IH S} {F IY L D}.|p302
212 | wav22_silence_trimmed_wav/p302/p302_312_mic1.wav|{HH AW EH V ER}, {DH EH R} {W AA Z} {N OW} {HH OW P}, {AE N D} {G L AO R IY} {T UW}, {F AO R} {S K AA T L AH N D}.|p302
213 | wav22_silence_trimmed_wav/p302/p302_313_mic1.wav|{T UW} {AO L} {IH N T EH N T S} {AE N D} {P ER P AH S AH Z}, {HH IY} {R AE N} {DH AH} {SH OW}.|p302
214 | wav22_silence_trimmed_wav/p303/p303_348_mic1.wav|{AY} {K AE N} {AH P IH R} {T UW} {B IY} {N AY S} {AE N D} {L AH V L IY}.|p303
215 | wav22_silence_trimmed_wav/p303/p303_349_mic1.wav|{AY} {D OW N T} {TH IH NG K} {DH AH} {S AO D AH S} {W IH L} {L EY} {D AW N}.|p303
216 | wav22_silence_trimmed_wav/p303/p303_350_mic1.wav|{P IY P AH L} {T EH N D AH D} {T UW} {S T EY} {DH EH R} {F AO R} {S AH M} {T AY M}.|p303
217 | wav22_silence_trimmed_wav/p304/p304_419_mic1.wav|{AY} {K AE N} {AH N D ER S T AE N D} {W AY} {DH EY} {HH AE V} {G AO N}.|p304
218 | wav22_silence_trimmed_wav/p304/p304_420_mic1.wav|{DH AE T}, {DH OW}, {W AA Z} {DH EH N}.|p304
219 | wav22_silence_trimmed_wav/p304/p304_421_mic1.wav|{P IY P AH L} {SH UH D} {N AA T} {B IH K AH M} {K AH M P L EY S AH N T}.|p304
220 | wav22_silence_trimmed_wav/p305/p305_417_mic1.wav|{W IY V} {S T IH L} {G AA T} {AH} {S EY} {IH N} {DH IH S}.|p305
221 | wav22_silence_trimmed_wav/p305/p305_418_mic1.wav|{SH IY} {K AE N T} {S EY} {W EH R} {SH IY} {W AA Z}, {W AH T} {SH IY} {D IH D}.|p305
222 | wav22_silence_trimmed_wav/p305/p305_419_mic1.wav|{AY L} {T EY K} {W AH T} {K AH M Z} {AH L AO NG}.|p305
223 | wav22_silence_trimmed_wav/p306/p306_355_mic1.wav|{AY} {D IH D AH N T} {IY V IH N} {HH AE V} {DH AH} {F ER S T} {AY D IY AH}.|p306
224 | wav22_silence_trimmed_wav/p306/p306_356_mic1.wav|{N OW} {P ER S AH N} {W AA Z} {CH AA R JH D}.|p306
225 | wav22_silence_trimmed_wav/p306/p306_357_mic1.wav|{AY V} {OW N L IY} {M EH T} {HH IH M} {TH R IY} {T AY M Z}.|p306
226 | wav22_silence_trimmed_wav/p307/p307_419_mic1.wav|{IH T S} {AH} {P AA L AH S IY} {W IH CH} {HH AE Z} {W ER K T} {F AO R} {AH S}.|p307
227 | wav22_silence_trimmed_wav/p307/p307_420_mic1.wav|{L OW K AH L IY}, {T UW}, {DH AH} {M}{EH M} {P IY} {IH Z} {AH N D ER} {F AY ER}.|p307
228 | wav22_silence_trimmed_wav/p307/p307_421_mic1.wav|{HH IH Z} {V Y UW Z} {AA R} {HH AA R D L IY} {S ER P R AY Z IH NG}.|p307
229 | wav22_silence_trimmed_wav/p308/p308_418_mic1.wav|{AO R} {S OW} {SH IY} {TH AO T}.|p308
230 | wav22_silence_trimmed_wav/p308/p308_419_mic1.wav|{IH T} {W AA Z} {AH N AH DH ER} {G UH D} {AY D IY AH}.|p308
231 | wav22_silence_trimmed_wav/p308/p308_420_mic1.wav|{DH IH S} {W UH D} {D IH S K ER IH JH} {IH N V EH S T M AH N T} {AE N D} {JH AA B} {K R IY EY SH AH N}.|p308
232 | wav22_silence_trimmed_wav/p310/p310_419_mic1.wav|{IH T} {W UH D} {SH OW} {DH EH M} {AH} {P AA Z AH T IH V} {W EY} {F AO R W ER D}.|p310
233 | wav22_silence_trimmed_wav/p310/p310_420_mic1.wav|{AY} {TH IH NG K} {IH T S} {AH} {D IH S G R EY S}.|p310
234 | wav22_silence_trimmed_wav/p310/p310_421_mic1.wav|{S OW} {HH IY} {SH UH D} {B IY}.|p310
235 | wav22_silence_trimmed_wav/p311/p311_418_mic1.wav|{AY} {AH G R IY} {W IH DH} {DH EH M}.|p311
236 | wav22_silence_trimmed_wav/p311/p311_419_mic1.wav|{AY} {F IY L} {V EH R IY} {S T R AO NG L IY} {AH B AW T} {DH AE T}.|p311
237 | wav22_silence_trimmed_wav/p311/p311_420_mic1.wav|{W IY} {W IH L} {L IH S AH N} {T UW} {K AA L IY G Z}.|p311
238 | wav22_silence_trimmed_wav/p312/p312_414_mic1.wav|{DH AH} {P R AH P OW Z AH L Z} {HH AE V} {B IH N} {T EY K AH N} {AW T} {AH V} {K AA N T EH K S T}.|p312
239 | wav22_silence_trimmed_wav/p312/p312_415_mic1.wav|{G AA L F} {HH AE Z} {L AO S T} {AH} {G R EY T} {K EH R IH K T ER}.|p312
240 | wav22_silence_trimmed_wav/p312/p312_416_mic1.wav|{HH AW EH V ER}, {IH T} {HH AE Z} {HH AE D} {OW N L IY} {AH} {L IH M AH T AH D} {AH P T EY K}.|p312
241 | wav22_silence_trimmed_wav/p313/p313_418_mic1.wav|{W UH D} {DH EY} {W ER K} {T AH G EH DH ER} {AH G EH N}?|p313
242 | wav22_silence_trimmed_wav/p313/p313_419_mic1.wav|{AY} {D IH D AH N T} {P L EY} {AO L} {DH AE T} {W EH L}.|p313
243 | wav22_silence_trimmed_wav/p313/p313_420_mic1.wav|{HH UW} {K AE N} {T EH L}?|p313
244 | wav22_silence_trimmed_wav/p314/p314_418_mic1.wav|{AE Z} {F AO R} {OW AH N}, {K L AE S} {IH Z} {P ER M AA N EH N T}.|p314
245 | wav22_silence_trimmed_wav/p314/p314_419_mic1.wav|{T IY} {EY CH} {IY} {P EY N} {W AA Z} {AO L M OW S T} {T UW} {M AH CH} {T UW} {B EH R}.|p314
246 | wav22_silence_trimmed_wav/p314/p314_420_mic1.wav|{M AY} {TH AO T S} {AA R} {W IH DH} {DH EH R} {F AE M AH L IY Z}.|p314
247 | wav22_silence_trimmed_wav/p316/p316_418_mic1.wav|{DH IH S} {IH Z} {DH AH} {G R EY T AH S T} {M OW M AH N T} {AH V} {AW ER} {L AY V Z}.|p316
248 | wav22_silence_trimmed_wav/p316/p316_419_mic1.wav|{AA D IY AH N S AH Z} {DH EH R} {W ER} {SH AA K T}.|p316
249 | wav22_silence_trimmed_wav/p316/p316_420_mic1.wav|{DH EY} {R IH S P AA N D IH D} {IH N} {DH AH} {M OW S T} {P AA Z AH T IH V} {W EY}.|p316
250 | wav22_silence_trimmed_wav/p317/p317_418_mic1.wav|{SH IY} {W AA Z} {IH K S T R IY M L IY} {R IH JH AH D}.|p317
251 | wav22_silence_trimmed_wav/p317/p317_419_mic1.wav|{N IY DH ER} {S AY D} {K AE N} {W IH N} {DH IH S} {W AO R}.|p317
252 | wav22_silence_trimmed_wav/p317/p317_420_mic1.wav|{B AA R OW IH NG} {P AW ER} {W AA Z} {W AH N} {AH V} {DH OW Z} {IH M P R UW V M AH N T S}.|p317
253 | wav22_silence_trimmed_wav/p318/p318_419_mic1.wav|{AY} {D IH D AH N T} {P L EY} {AO L} {DH AE T} {W EH L}.|p318
254 | wav22_silence_trimmed_wav/p318/p318_420_mic1.wav|{HH UW} {K AE N} {T EH L}?|p318
255 | wav22_silence_trimmed_wav/p318/p318_421_mic1.wav|{IH T} {AH P IH R Z} {T UW} {HH AE V} {B IH N} {K L OW Z D} {D AW N}.|p318
256 | wav22_silence_trimmed_wav/p323/p323_418_mic1.wav|{W IY} {T EH N D} {T UW} {K AH M} {G UH D} {AE T} {DH AH} {EH N D}.|p323
257 | wav22_silence_trimmed_wav/p323/p323_419_mic1.wav|{AY} {HH AE V AH N T} {EH N JH OY D} {DH AH} {L AE S T} {K AH P AH L} {AH V} {Y IH R Z}.|p323
258 | wav22_silence_trimmed_wav/p323/p323_420_mic1.wav|{DH AH} {K AH M IH SH AH N} {IH Z} {N AA T} {DH AH} {OW N L IY} {L UW Z ER}.|p323
259 | wav22_silence_trimmed_wav/p326/p326_395_mic1.wav|{IH T} {W AA Z} {L AY K} {DH AH} {OW L D} {D EY Z}.|p326
260 | wav22_silence_trimmed_wav/p326/p326_396_mic1.wav|{AH} {G R EY T} {D IY L} {HH AE Z} {B IH N} {AH CH IY V D}.|p326
261 | wav22_silence_trimmed_wav/p326/p326_397_mic1.wav|{M AY} {F Y UW CH ER} {IH Z} {IH N} {DH AH} {M EH R AH TH AA N}.|p326
262 | wav22_silence_trimmed_wav/p329/p329_419_mic1.wav|{W IY V} {K AH M} {F R AH M} {AH} {L AO NG} {W EY} {B AE K}.|p329
263 | wav22_silence_trimmed_wav/p329/p329_420_mic1.wav|{AY V} {B IH N} {AE S K T} {DH AE T} {AO L R EH D IY}, {DH IH S} {M AO R N IH NG} - {AA N} {DH AH} {R EY D IY OW}.|p329
264 | wav22_silence_trimmed_wav/p329/p329_421_mic1.wav|{AY L} {N EH V ER} {TH IH NG K} {AH V} {M AY S EH L F} {AE Z} {AH} {S T AA R}.|p329
265 | wav22_silence_trimmed_wav/p330/p330_418_mic1.wav|{N AW}, {HH AW EH V ER}, {IH T} {HH AE Z} {AH N D ER G AO N} {AH} {D R AH M AE T IH K} {D IH K L AY N}.|p330
266 | wav22_silence_trimmed_wav/p330/p330_419_mic1.wav|{DH IH S} {IH Z} {AH B AW T} {DH AH} {AO R AH N JH} {AO R D ER}.|p330
267 | wav22_silence_trimmed_wav/p330/p330_420_mic1.wav|{B AH T} {IH T} {L UH K S} {G UH D} {F AO R} {N EH K S T} {Y IH R}.|p330
268 | wav22_silence_trimmed_wav/p333/p333_419_mic1.wav|{DH EY} {W ER} {IH N} {G UH D} {HH Y UW M ER}, {T UW}.|p333
269 | wav22_silence_trimmed_wav/p333/p333_420_mic1.wav|{IH T} {Y UW Z D} {T UW} {B AA DH ER} {M IY} {S AH M T AY M Z}, {B AH T} {IH T} {D AH Z AH N T} {EH N IY} {M AO R}.|p333
270 | wav22_silence_trimmed_wav/p333/p333_421_mic1.wav|{TH R IY} {M AH N TH S} {AO R} {S OW} {L EY T ER}, {DH EY} {AA R} {R IH T ER N IH NG}.|p333
271 | wav22_silence_trimmed_wav/p334/p334_419_mic1.wav|{DH IH S} {IH Z} {AH} {TH ER OW L IY} {HH AE N S AH M} {AE N D} {EH N JH OY AH B AH L} {P R AH D AH K SH AH N}.|p334
272 | wav22_silence_trimmed_wav/p334/p334_420_mic1.wav|{N OW} {EH V AH D AH N S} {DH AE T} {IH T} {W AA Z} {OW S AH M AH} {B IH N} {L EY D AH N}.|p334
273 | wav22_silence_trimmed_wav/p334/p334_421_mic1.wav|{AY} {TH IH NG K} {IH T} {IH Z} {T OW T AH L IY} {R AO NG}.|p334
274 | wav22_silence_trimmed_wav/p335/p335_418_mic1.wav|{L AO R D} {S EY N S B EH R IY} {IH Z} {N AA T} {AH} {N UW K AH M ER}.|p335
275 | wav22_silence_trimmed_wav/p335/p335_419_mic1.wav|{HH IY} {IH Z} {N AA T} {V EH R IY} {B IH G}, {IY DH ER}.|p335
276 | wav22_silence_trimmed_wav/p335/p335_420_mic1.wav|{S T IH L}, {DH AH} {IH K S P IH R IY AH N S} {W AA Z} {AH M EY Z IH NG}, {SH IY} {S EH Z}.|p335
277 | wav22_silence_trimmed_wav/p336/p336_419_mic1.wav|{W UH D} {HH IY} {EH V ER} {G OW} {B AE K} {T UW} {DH AH} {B IH G IH N IH NG}?|p336
278 | wav22_silence_trimmed_wav/p336/p336_420_mic1.wav|{W AH T} {IH Z} {G OW IH NG} {AA N}?|p336
279 | wav22_silence_trimmed_wav/p336/p336_421_mic1.wav|{L EH T} {DH AE T} {B IY} {DH EH R} {M AH M AO R IY AH L}.|p336
280 | wav22_silence_trimmed_wav/p339/p339_419_mic1.wav|{W IY} {SH UH D} {HH AE V} {B IH N} {R IH W AO R D IH D} {F AO R} {P R AH D UW S IH NG}.|p339
281 | wav22_silence_trimmed_wav/p339/p339_420_mic1.wav|{AY} {AE M} {AO L S OW} {D IH L AY T AH D} {F AO R} {AO L} {DH AH} {P L EY ER Z}.|p339
282 | wav22_silence_trimmed_wav/p339/p339_421_mic1.wav|{HH IY} {D IH D} {V EH R IY} {W EH L}, {B AH T} {D IH D AH N T} {W IH N}.|p339
283 | wav22_silence_trimmed_wav/p340/p340_419_mic1.wav|{AY L} {N EH V ER} {TH IH NG K} {AH V} {M AY S EH L F} {AE Z} {AH} {S T AA R}.|p340
284 | wav22_silence_trimmed_wav/p340/p340_420_mic1.wav|{IH T} {W AA Z} {W AH N} {N UW} {Y IH R}.|p340
285 | wav22_silence_trimmed_wav/p340/p340_421_mic1.wav|{W IY} {M EY D} {AH} {W ER L D} {R EH K ER D} {T AH G EH DH ER}.|p340
286 | wav22_silence_trimmed_wav/p341/p341_405_mic1.wav|{Y UW} {HH AE V} {T UW} {HH AE V} {S AH M} {HH OW P} {AE N D} {F EY TH}.|p341
287 | wav22_silence_trimmed_wav/p341/p341_406_mic1.wav|{DH EY} {HH OW L D} {AA N} {F AO R} {M EH N IY} {Y IH R Z}.|p341
288 | wav22_silence_trimmed_wav/p341/p341_407_mic1.wav|{HH AW EH V ER}, {DH AH} {R IY P AO R T} {W AA Z} {IH N K ER EH K T}.|p341
289 | wav22_silence_trimmed_wav/p343/p343_395_mic1.wav|{IY CH} {W AH N} {IH Z} {AE Z} {G UH D} {AE Z} {DH AH} {AH DH ER}.|p343
290 | wav22_silence_trimmed_wav/p343/p343_396_mic1.wav|{S AH B} {N AA T} {Y UW Z D}, {M AA R TH AH R}.|p343
291 | wav22_silence_trimmed_wav/p343/p343_397_mic1.wav|{AY} {HH AE V} {HH AA R D L IY} {EH N IY} {K AH M IH T M AH N T S}.|p343
292 | wav22_silence_trimmed_wav/p345/p345_395_mic1.wav|{AY} {W OW N T} {G OW} {B AE K} {T UW} {HH AA L AH N D}.|p345
293 | wav22_silence_trimmed_wav/p345/p345_396_mic1.wav|{DH EH R} {IH Z} {S T IH L} {AH} {L AO NG} {W EY} {T UW} {G OW}.|p345
294 | wav22_silence_trimmed_wav/p345/p345_397_mic1.wav|{AO L M OW S T} {OW V ER N AY T}, {HH IY} {F AW N D} {HH IH Z} {V OY S}.|p345
295 | wav22_silence_trimmed_wav/p347/p347_419_mic1.wav|{DH EY} {W AO N T IH D} {T UW} {SH OW} {W AH T} {DH EY} {K UH D} {D UW}.|p347
296 | wav22_silence_trimmed_wav/p347/p347_420_mic1.wav|{AY} {JH AH S T} {W AA N T} {T UW} {G EH T} {R IH D} {AH V} {IH T}.|p347
297 | wav22_silence_trimmed_wav/p347/p347_421_mic1.wav|{AH F IH SH AH L Z} {S EY} {DH AH} {S IH T IY} {M AH S T} {AH CH IY V} {DH IH S}.|p347
298 | wav22_silence_trimmed_wav/p351/p351_419_mic1.wav|{W AH T S} {IH T} {F AO R}, {AE F T ER} {AO L}?|p351
299 | wav22_silence_trimmed_wav/p351/p351_420_mic1.wav|{W IY} {W IH L} {B IY} {P L IY Z D} {T UW} {T AO K} {T UW} {DH EH M}.|p351
300 | wav22_silence_trimmed_wav/p351/p351_421_mic1.wav|{AH} {N AH M B ER} {AH V} {P IY P AH L} {K UH D} {N AA T} {S P IY K} {IH NG G L IH SH}.|p351
301 | wav22_silence_trimmed_wav/p360/p360_419_mic1.wav|{AY} {D OW N T} {TH IH NG K} {DH EY} {AA R} {T UW} {S IH R IY AH S}.|p360
302 | wav22_silence_trimmed_wav/p360/p360_420_mic1.wav|{T AY G ER} {IH Z} {DH AH} {IH K S EH P SH AH N}.|p360
303 | wav22_silence_trimmed_wav/p360/p360_421_mic1.wav|{DH EH R} {W AA Z} {N OW} {HH AE P IY} {EH N D IH NG}.|p360
304 | wav22_silence_trimmed_wav/p361/p361_419_mic1.wav|{EH V R IY W AH N} {IH N} {B R IH T AH N} {IH Z} {P R AW D} {AH V} {DH IH S} {T IY M}.|p361
305 | wav22_silence_trimmed_wav/p361/p361_420_mic1.wav|{W IY} {S ER T AH N L IY} {D UW} {N AA T} {IH N T EH N D} {T UW} {D IH S B AE N D}.|p361
306 | wav22_silence_trimmed_wav/p361/p361_421_mic1.wav|{M AY} {L AY F} {IH Z} {G OW IH NG} {T UW} {G OW} {AA N}.|p361
307 | wav22_silence_trimmed_wav/p362/p362_397_mic1.wav|{L OW K EY T} {IH N} {S K AA T L AH N D} {AO L S OW} {R AH F Y UW Z D} {T UW} {K AA M EH N T}.|p362
308 | wav22_silence_trimmed_wav/p362/p362_399_mic1.wav|{DH AE T} {IH Z} {W AY} {HH IY} {IH Z} {S OW} {S P EH SH AH L}.|p362
309 | wav22_silence_trimmed_wav/p362/p362_406_mic1.wav|{HH IY} {AE D IH D} {DH AE T} {HH IY} {F EH L T} {AH N D ER M AY N D}.|p362
310 | wav22_silence_trimmed_wav/p363/p363_418_mic1.wav|{OW V ER AO L}, {IH T} {HH AE Z} {S ER T AH N L IY} {W ER K T} {AW T}.|p363
311 | wav22_silence_trimmed_wav/p363/p363_419_mic1.wav|{DH IH S} {IH Z} {AE K CH UW AH L IY} {T R UW}.|p363
312 | wav22_silence_trimmed_wav/p363/p363_420_mic1.wav|{B OW TH} {W ER} {R IH D UW S T} {T UW} {R AH B AH L}.|p363
313 | wav22_silence_trimmed_wav/p364/p364_303_mic1.wav|{IH T} {W AA Z} {DH EH N} {DH AE T} {R EY N JH ER Z} {S K AO R D}.|p364
314 | wav22_silence_trimmed_wav/p364/p364_304_mic1.wav|{DH IH S} {W AA Z} {R IH P IY T IH D} {DH AH} {S EH K AH N D} {T AY M}.|p364
315 | wav22_silence_trimmed_wav/p364/p364_305_mic1.wav|{AY} {CH OW Z} {DH AH} {F AO R M ER}.|p364
316 | wav22_silence_trimmed_wav/p374/p374_419_mic1.wav|{DH EY} {HH AE V} {N AA T} {B IH N} {N EY M D}.|p374
317 | wav22_silence_trimmed_wav/p374/p374_420_mic1.wav|{IH T} {F IY L Z} {L AY K} {T UW} {M AH CH}.|p374
318 | wav22_silence_trimmed_wav/p374/p374_421_mic1.wav|{DH IH S} {IH Z} {DH AH} {L AE S T} {TH IH NG} {W IY} {EH V ER} {IH K S P EH K T AH D}.|p374
319 | wav22_silence_trimmed_wav/p376/p376_419_mic1.wav|{AY V} {N EH V ER} {D AH N} {DH AE T} {B IH F AO R} {IH N} {M AY} {L AY F}.|p376
320 | wav22_silence_trimmed_wav/p376/p376_420_mic1.wav|{AY} {K AE N T} {S T AA P} {P IY P AH L} {TH IH NG K IH NG}.|p376
321 | wav22_silence_trimmed_wav/p376/p376_421_mic1.wav|{DH AE T} {W IH L} {B IY} {B EH S T} {F AO R} {HH IH M}.|p376
322 | wav22_silence_trimmed_wav/s5/s5_395_mic1.wav|{DH AE T} {HH AE D} {HH AE P AH N D} {L AO NG} {B IH F AO R} {HH AE F}-{T AY M} {Y EH S T ER D EY}.|s5
323 | wav22_silence_trimmed_wav/s5/s5_396_mic1.wav|{M EH N IY} {F AA R M ER Z} {K AE N AA T} {IY V IH N} {AH G R IY} {W IH DH IH N} {DH EH R} {OW N} {F AE M AH L IY Z}.|s5
324 | wav22_silence_trimmed_wav/s5/s5_397_mic1.wav|{N UW} {Y AO R K} {P R AH V AY D AH D} {L IH T AH L} {HH EH L P} {F AO R} {DH AH} {L AH N D AH N} {M AA R K AH T}.|s5
325 | 


--------------------------------------------------------------------------------
/inference.py:
--------------------------------------------------------------------------------
  1 | import warnings
  2 | warnings.filterwarnings(action='ignore')
  3 | 
  4 | import os
  5 | import utils
  6 | import argparse
  7 | import torch
  8 | from models import SynthesizerTrn
  9 | from text.symbols import symbol_len
 10 | from data_utils import infer_text_process
 11 | 
 12 | import soundcard as sc
 13 | import soundfile as sf
 14 | 
 15 | import time
 16 | 
 17 | def inference(args):
 18 | 
 19 |     hps     = utils.get_hparams(args)
 20 |     device  = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 21 |     torch.manual_seed(hps.train.seed)
 22 | 
 23 |     net_g   = SynthesizerTrn(symbol_len(hps.data.languages),
 24 |                              hps.data.filter_length // 2 + 1,
 25 |                              hps.train.segment_size // hps.data.hop_length,
 26 |                              n_speakers=len(hps.data.speakers),
 27 |                              midi_start=hps.data.midi_start,
 28 |                              midi_end=hps.data.midi_end,
 29 |                              octave_range=hps.data.octave_range,
 30 |                              **hps.model,
 31 |                              ### add ###
 32 |                              sr=hps.data.sampling_rate,
 33 |                              W=hps.data.ying_window,
 34 |                              w_step=hps.data.ying_hop,
 35 |                              tau_max=hps.data.tau_max
 36 |                              ###########
 37 |                              ).to(device)
 38 |     
 39 |     # load checkpoint
 40 |     utils.load_checkpoint(args.model_path, model_g=net_g)
 41 | 
 42 |     # play audio by system default
 43 |     speaker = sc.get_speaker(sc.default_speaker().name)
 44 |     s_name_list = hps.data.speakers
 45 | 
 46 |     # parameter settings
 47 |     noise_scale     = torch.tensor(0.66)    # adjust z_p noise
 48 |     noise_scale_w   = torch.tensor(0.66)    # adjust SDP noise
 49 |     length_scale    = torch.tensor(1.0)     # adjust sound length scale (talk speed)
 50 |     max_len         = None                  # frame max length
 51 | 
 52 |     if args.is_save is True:
 53 |         n_save = 0
 54 |         save_dir = os.path.join("./infer_logs/", args.model)
 55 |         os.makedirs(save_dir, exist_ok=True)
 56 | 
 57 |     while True:
 58 |         
 59 |         print("\n")
 60 | 
 61 |         # get speaker id
 62 |         speaker_name = input("Enter speaker name. ==> ")
 63 |         if speaker_name=="":
 64 |             print("Empty input is detected... Exit...")
 65 |             break
 66 |         try:
 67 |             sid = int(s_name_list.index(speaker_name))
 68 |         except:
 69 |             while True:
 70 |                 print(f"{speaker_name} does not existed. The list of speakers is displayed below.")
 71 |                 print(s_name_list)
 72 |                 speaker_name = input("Enter speaker name. ==> ")
 73 |                 try:
 74 |                     sid = int(s_name_list.index(speaker_name))
 75 |                     break
 76 |                 except:
 77 |                     print("~RETRY~")
 78 |                     pass
 79 |         sid = torch.unsqueeze(torch.tensor(sid, dtype=torch.int64), dim=0)
 80 | 
 81 |         # get text
 82 |         text = input("Enter text. ==> ")
 83 |         if text=="":
 84 |             print("Empty input is detected... Exit...")
 85 |             break
 86 |         
 87 |         # get shift pitch
 88 |         scope = input("Enter pitch shift value(int). -15〜15 ==> ")
 89 |         if scope=="":
 90 |             print("Empty input is detected... Exit...")
 91 |             break
 92 |         try:
 93 |             scope = int(scope)
 94 |         except:
 95 |             while True:
 96 |                 print(f"Enter an integer value. {scope} is not integer value.")
 97 |                 scope = input("Enter pitch shift value(int). -15〜15 ==> ")
 98 |                 try:
 99 |                     scope = int(scope)
100 |                     break
101 |                 except:
102 |                     print("~RETRY~")
103 |                     pass
104 |         if scope > 15:
105 |             scope = 15
106 |             print("The upper limit of pitch shift value is 15.  15 is entered.")
107 |         elif scope < -15:
108 |             scope = -15
109 |             print("The lower limit of pitch shift value is -15.  -15 is entered.")
110 |         scope_shift = torch.tensor(scope, dtype=torch.int64)      # pitch adjust
111 |         
112 |         # measure the execution time 
113 |         torch.cuda.synchronize()
114 |         start = time.time()
115 | 
116 |         # required_grad is False
117 |         with torch.inference_mode():
118 |             x, t, x_lengths = infer_text_process(text=text, hps=hps)
119 | 
120 |             # generate audio
121 |             y_hat, _, _, _ = net_g.infer(x=x.to(device), 
122 |                                          t=t.to(device), 
123 |                                          x_lengths=x_lengths.to(device), 
124 |                                          sid=sid.to(device), 
125 |                                          noise_scale=noise_scale.to(device), 
126 |                                          noise_scale_w=noise_scale_w.to(device), 
127 |                                          length_scale=length_scale.to(device), 
128 |                                          max_len=max_len, 
129 |                                          scope_shift=scope_shift.to(device))
130 | 
131 |         # measure the execution time 
132 |         torch.cuda.synchronize()
133 |         elapsed_time = time.time() - start
134 |         print(f"Gen Time : {elapsed_time}")
135 |         
136 |         # play audio
137 |         speaker.play(y_hat.view(-1).to('cpu').detach().numpy().copy(), hps.data.sampling_rate)
138 |         
139 |         # save audio
140 |         if args.is_save is True:
141 |             n_save += 1
142 |             save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{speaker_name}_scope={scope}_{text}.wav")
143 |             data = y_hat.view(-1).to('cpu').detach().numpy().copy()
144 |             sf.write(
145 |                      file=save_path,
146 |                      data=data,
147 |                      samplerate=hps.data.sampling_rate,
148 |                      format="WAV")
149 |             print(f"Audio is saved at : {save_path}")
150 | 
151 | 
152 |     return 0
153 | 
154 | if __name__ == "__main__":
155 | 
156 |     parser = argparse.ArgumentParser()
157 |     parser.add_argument('--config',
158 |                         type=str,
159 |                         required=True,     
160 |                         help='Path to configuration file')
161 |     parser.add_argument('--model',
162 |                         type=str,
163 |                         required=True,           
164 |                         help='Model name')
165 |     parser.add_argument('--model_path',
166 |                         type=str,
167 |                         required=True,
168 |                         help='Path to checkpoint')
169 |     parser.add_argument('--is_save',
170 |                         type=str,
171 |                         default=True,
172 |                         help='Whether to save output or not')
173 |     args = parser.parse_args()
174 |     
175 |     inference(args)


--------------------------------------------------------------------------------
/losses.py:
--------------------------------------------------------------------------------
 1 | # from https://github.com/jaywalnut310/vits
 2 | import torch
 3 | from torch.autograd import Function
 4 | 
 5 | 
 6 | def feature_loss(fmap_r, fmap_g):
 7 |     loss = 0
 8 |     for dr, dg in zip(fmap_r, fmap_g):
 9 |         for rl, gl in zip(dr, dg):
10 |             rl = rl.float().detach()
11 |             gl = gl.float()
12 |             loss += torch.mean(torch.abs(rl - gl))
13 | 
14 |     return loss * 2
15 | 
16 | 
17 | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
18 |     loss = 0
19 |     r_losses = []
20 |     g_losses = []
21 |     for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
22 |         dr = dr.float()
23 |         dg = dg.float()
24 |         r_loss = torch.mean((1-dr)**2)
25 |         g_loss = torch.mean(dg**2)
26 |         loss += (r_loss + g_loss)
27 |         r_losses.append(r_loss.item())
28 |         g_losses.append(g_loss.item())
29 | 
30 |     return loss, r_losses, g_losses
31 | 
32 | 
33 | def generator_loss(disc_outputs):
34 |     loss = 0
35 |     gen_losses = []
36 |     for dg in disc_outputs:
37 |         dg = dg.float()
38 |         l = torch.mean((1-dg)**2)
39 |         gen_losses.append(l)
40 |         loss += l
41 | 
42 |     return loss, gen_losses
43 | 
44 | 
45 | def kl_loss(z_p, logs, m_p, logs_p, z_mask):
46 |     """
47 |     z_p, logs: [b, h, t_t]
48 |     m_p, logs_p: [b, h, t_t]
49 |     """
50 |     z_p = z_p.float()
51 |     logs = logs.float()
52 |     m_p = m_p.float()
53 |     logs_p = logs_p.float()
54 |     z_mask = z_mask.float()
55 | 
56 |     kl = logs_p - logs - 0.5
57 |     kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
58 |     kl = torch.sum(kl * z_mask)
59 |     l = kl / torch.sum(z_mask)
60 |     return l
61 | 
62 | 
63 | class ReverseLayerF(Function):
64 | 
65 |     @staticmethod
66 |     def forward(ctx, x, alpha):
67 |         ctx.alpha = alpha
68 | 
69 |         return x.view_as(x)
70 | 
71 |     @staticmethod
72 |     def backward(ctx, grad_output):
73 |         output = grad_output.neg() * ctx.alpha
74 | 
75 |         return output, None
76 | 


--------------------------------------------------------------------------------
/mel_processing.py:
--------------------------------------------------------------------------------
  1 | # from https://github.com/jaywalnut310/vits
  2 | import torch
  3 | import torch.utils.data
  4 | from librosa.filters import mel as librosa_mel_fn
  5 | from torch.cuda.amp import autocast
  6 | 
  7 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
  8 |     """
  9 |     PARAMS
 10 |     ------
 11 |     C: compression factor
 12 |     """
 13 |     return torch.log(torch.clamp(x, min=clip_val) * C)
 14 | 
 15 | 
 16 | def dynamic_range_decompression_torch(x, C=1):
 17 |     """
 18 |     PARAMS
 19 |     ------
 20 |     C: compression factor used to compress
 21 |     """
 22 |     return torch.exp(x) / C
 23 | 
 24 | 
 25 | def spectral_normalize_torch(magnitudes):
 26 |     output = dynamic_range_compression_torch(magnitudes)
 27 |     return output
 28 | 
 29 | 
 30 | def spectral_de_normalize_torch(magnitudes):
 31 |     output = dynamic_range_decompression_torch(magnitudes)
 32 |     return output
 33 | 
 34 | 
 35 | mel_basis = {}
 36 | hann_window = {}
 37 | 
 38 | 
 39 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
 40 |     if torch.min(y) < -1.:
 41 |         print('min value is ', torch.min(y))
 42 |     if torch.max(y) > 1.:
 43 |         print('max value is ', torch.max(y))
 44 | 
 45 |     global hann_window
 46 |     dtype_device = str(y.dtype) + '_' + str(y.device)
 47 |     wnsize_dtype_device = str(win_size) + '_' + dtype_device
 48 |     if wnsize_dtype_device not in hann_window:
 49 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)
 50 | 
 51 |     y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
 52 |     y = y.squeeze(1)
 53 |     with autocast(enabled=False):
 54 |         y=y.float()
 55 |         spec = torch.stft(
 56 |             y, 
 57 |             n_fft, 
 58 |             hop_length=hop_size, 
 59 |             win_length=win_size, 
 60 |             window=hann_window[wnsize_dtype_device],
 61 |             center=center, 
 62 |             pad_mode='reflect', 
 63 |             normalized=False, 
 64 |             onesided=True,
 65 |             return_complex=False
 66 |         )
 67 | 
 68 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
 69 |     return spec
 70 | 
 71 | 
 72 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
 73 |     global mel_basis
 74 |     dtype_device = str(spec.dtype) + '_' + str(spec.device)
 75 |     fmax_dtype_device = str(fmax) + '_' + dtype_device
 76 |     if fmax_dtype_device not in mel_basis:
 77 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
 78 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=spec.dtype, device=spec.device)
 79 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
 80 |     spec = spectral_normalize_torch(spec)
 81 |     return spec
 82 | 
 83 | 
 84 | def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
 85 |     if torch.min(y) < -1.:
 86 |         print('min value is ', torch.min(y))
 87 |     if torch.max(y) > 1.:
 88 |         print('max value is ', torch.max(y))
 89 | 
 90 |     global mel_basis, hann_window
 91 |     dtype_device = str(y.dtype) + '_' + str(y.device)
 92 |     fmax_dtype_device = str(fmax) + '_' + dtype_device
 93 |     wnsize_dtype_device = str(win_size) + '_' + dtype_device
 94 |     if fmax_dtype_device not in mel_basis:
 95 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
 96 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device)
 97 |     if wnsize_dtype_device not in hann_window:
 98 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)
 99 | 
100 |     y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
101 |     y = y.squeeze(1)
102 |     with autocast(enabled=False):
103 |         y=y.float()
104 |         spec = torch.stft(
105 |             y, 
106 |             n_fft, 
107 |             hop_length=hop_size, 
108 |             win_length=win_size, 
109 |             window=hann_window[wnsize_dtype_device],
110 |             center=center, 
111 |             pad_mode='reflect', 
112 |             normalized=False, 
113 |             onesided=True,
114 |             return_complex=False
115 |         )
116 | 
117 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
118 | 
119 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
120 |     spec = spectral_normalize_torch(spec)
121 | 
122 |     return spec
123 | 


--------------------------------------------------------------------------------
/metadata_cleaners.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | 
 4 | from text import cleaners
 5 | 
 6 | 
 7 | def main(args):
 8 |     cleaner = getattr(cleaners, args.cleaner)
 9 |     if not cleaner:
10 |         raise Exception('Unknown cleaner: %s' % args.cleaner)
11 | 
12 |     train_path = args.train_path
13 |     train_metadata = list()
14 | 
15 |     val_path = args.validation_path
16 |     val_metadata = list()
17 | 
18 |     with open(train_path, "r", encoding="utf-8") as f:
19 |         for line in f:
20 |             wav_path, text, speaker = line.strip("\n").split("|")
21 |             meta = [wav_path, cleaner(text), speaker]
22 |             train_metadata.append(meta)
23 | 
24 |     path, ext = os.path.splitext(train_path)
25 |     with open(path + "_cleaned" + ext, "w", encoding="utf-8") as f:
26 |         for meta in train_metadata:
27 |             f.write("|".join(meta) + "\n")
28 | 
29 |     if val_path is not None:
30 |         with open(val_path, "r", encoding="utf-8") as f:
31 |             for line in f:
32 |                 wav_path, text, speaker = line.strip("\n").split("|")
33 |                 meta = [wav_path, cleaner(text), speaker]
34 |                 val_metadata.append(meta)
35 | 
36 |         path, ext = os.path.splitext(val_path)
37 |         with open(path + "_cleaned" + ext, "w", encoding="utf-8") as f:
38 |             for meta in val_metadata:
39 |                 f.write("|".join(meta) + "\n")
40 | 
41 | 
42 | if __name__ == "__main__":
43 |     parser = argparse.ArgumentParser()
44 |     parser.add_argument(
45 |         "-t",
46 |         "--train_path",
47 |         type=str,
48 |         required=True,
49 |         help="path to train meatadata",
50 |     )
51 |     parser.add_argument(
52 |         "-v", 
53 |         "--validation_path", 
54 |         type=str, 
55 |         default=None,
56 |         help="path to validation meatadata",
57 |     )
58 |     parser.add_argument(
59 |         "-c", 
60 |         "--cleaner", 
61 |         type=str, 
62 |         default='korean_cleaners',
63 |         help="cleaner for text cleaning",
64 |     )
65 | 
66 |     args = parser.parse_args()
67 |     main(args)
68 | 


--------------------------------------------------------------------------------
/modules.py:
--------------------------------------------------------------------------------
  1 | # from https://github.com/jaywalnut310/vits
  2 | import math
  3 | import torch
  4 | from torch import nn
  5 | from torch.nn import Conv1d
  6 | from torch.nn import functional as F
  7 | from torch.nn.utils import weight_norm, remove_weight_norm
  8 | 
  9 | import commons
 10 | from commons import init_weights, get_padding
 11 | from transforms import piecewise_rational_quadratic_transform
 12 | 
 13 | 
 14 | LRELU_SLOPE = 0.1
 15 | 
 16 | 
 17 | class LayerNorm(nn.Module):
 18 |     def __init__(self, channels, eps=1e-5):
 19 |         super().__init__()
 20 |         self.channels = channels
 21 |         self.eps = eps
 22 | 
 23 |         self.gamma = nn.Parameter(torch.ones(channels))
 24 |         self.beta = nn.Parameter(torch.zeros(channels))
 25 | 
 26 |     def forward(self, x):
 27 |         x = x.transpose(1, -1)
 28 |         x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
 29 |         return x.transpose(1, -1)
 30 | 
 31 | 
 32 | class ConvReluNorm(nn.Module):
 33 |     def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout):
 34 |         super().__init__()
 35 |         self.in_channels = in_channels
 36 |         self.hidden_channels = hidden_channels
 37 |         self.out_channels = out_channels
 38 |         self.kernel_size = kernel_size
 39 |         self.n_layers = n_layers
 40 |         self.p_dropout = p_dropout
 41 |         assert n_layers > 1, "Number of layers should be larger than 0."
 42 | 
 43 |         self.conv_layers = nn.ModuleList()
 44 |         self.norm_layers = nn.ModuleList()
 45 |         self.conv_layers.append(
 46 |             nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size//2)
 47 |         )
 48 |         self.norm_layers.append(LayerNorm(hidden_channels))
 49 |         self.relu_drop = nn.Sequential(
 50 |             nn.ReLU(),
 51 |             nn.Dropout(p_dropout))
 52 |         for _ in range(n_layers-1):
 53 |             self.conv_layers.append(nn.Conv1d(
 54 |                 hidden_channels, hidden_channels, kernel_size, padding=kernel_size//2)
 55 |             )
 56 |             self.norm_layers.append(LayerNorm(hidden_channels))
 57 |         self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
 58 |         self.proj.weight.data.zero_()
 59 |         self.proj.bias.data.zero_()
 60 | 
 61 |     def forward(self, x, x_mask):
 62 |         x_org = x
 63 |         for i in range(self.n_layers):
 64 |             x = self.conv_layers[i](x * x_mask)
 65 |             x = self.norm_layers[i](x)
 66 |             x = self.relu_drop(x)
 67 |         x = x_org + self.proj(x)
 68 |         return x * x_mask
 69 | 
 70 | 
 71 | class DDSConv(nn.Module):
 72 |     """Dialted and Depth-Separable Convolution"""
 73 |     def __init__(self, channels, kernel_size, n_layers, p_dropout=0.):
 74 |         super().__init__()
 75 |         self.channels = channels
 76 |         self.kernel_size = kernel_size
 77 |         self.n_layers = n_layers
 78 |         self.p_dropout = p_dropout
 79 | 
 80 |         self.drop = nn.Dropout(p_dropout)
 81 |         self.convs_sep = nn.ModuleList()
 82 |         self.convs_1x1 = nn.ModuleList()
 83 |         self.norms_1 = nn.ModuleList()
 84 |         self.norms_2 = nn.ModuleList()
 85 |         for i in range(n_layers):
 86 |             dilation = kernel_size ** i
 87 |             padding = (kernel_size * dilation - dilation) // 2
 88 |             self.convs_sep.append(
 89 |                 nn.Conv1d(
 90 |                     channels, 
 91 |                     channels, 
 92 |                     kernel_size,
 93 |                     groups=channels, 
 94 |                     dilation=dilation, 
 95 |                     padding=padding
 96 |                 )
 97 |             )
 98 |             self.convs_1x1.append(nn.Conv1d(channels, channels, 1))
 99 |             self.norms_1.append(LayerNorm(channels))
100 |             self.norms_2.append(LayerNorm(channels))
101 | 
102 |     def forward(self, x, x_mask, g=None):
103 |         if g is not None:
104 |             x = x + g
105 |         for i in range(self.n_layers):
106 |             y = self.convs_sep[i](x * x_mask)
107 |             y = self.norms_1[i](y)
108 |             y = F.gelu(y)
109 |             y = self.convs_1x1[i](y)
110 |             y = self.norms_2[i](y)
111 |             y = F.gelu(y)
112 |             y = self.drop(y)
113 |             x = x + y
114 |         return x * x_mask
115 | 
116 | 
117 | class WN(torch.nn.Module):
118 |     def __init__(self, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0):
119 |         super(WN, self).__init__()
120 |         assert(kernel_size % 2 == 1)
121 |         self.hidden_channels = hidden_channels
122 |         self.kernel_size = kernel_size,
123 |         self.dilation_rate = dilation_rate
124 |         self.n_layers = n_layers
125 |         self.gin_channels = gin_channels
126 |         self.p_dropout = p_dropout
127 | 
128 |         self.in_layers = torch.nn.ModuleList()
129 |         self.res_skip_layers = torch.nn.ModuleList()
130 |         self.drop = nn.Dropout(p_dropout)
131 | 
132 |         if gin_channels != 0:
133 |             cond_layer = torch.nn.Conv1d(gin_channels, 2*hidden_channels*n_layers, 1)
134 |             self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
135 | 
136 |         for i in range(n_layers):
137 |             dilation = dilation_rate ** i
138 |             padding = int((kernel_size * dilation - dilation) / 2)
139 |             in_layer = torch.nn.Conv1d(
140 |                 hidden_channels, 
141 |                 2*hidden_channels, 
142 |                 kernel_size,
143 |                 dilation=dilation, 
144 |                 padding=padding
145 |             )
146 |             in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
147 |             self.in_layers.append(in_layer)
148 | 
149 |             # last one is not necessary
150 |             if i < n_layers - 1:
151 |                 res_skip_channels = 2 * hidden_channels
152 |             else:
153 |                 res_skip_channels = hidden_channels
154 | 
155 |             res_skip_layer = torch.nn.Conv1d(
156 |                 hidden_channels, res_skip_channels, 1
157 |             )
158 |             res_skip_layer = torch.nn.utils.weight_norm(
159 |                 res_skip_layer, name='weight'
160 |             )
161 |             self.res_skip_layers.append(res_skip_layer)
162 | 
163 |     def forward(self, x, x_mask, g=None, **kwargs):
164 |         output = torch.zeros_like(x)
165 |         n_channels_tensor = torch.IntTensor([self.hidden_channels])
166 | 
167 |         if g is not None:
168 |             g = self.cond_layer(g)
169 | 
170 |         for i in range(self.n_layers):
171 |             x_in = self.in_layers[i](x)
172 |             if g is not None:
173 |                 cond_offset = i * 2 * self.hidden_channels
174 |                 g_l = g[:, cond_offset:cond_offset+2*self.hidden_channels, :]
175 |             else:
176 |                 g_l = torch.zeros_like(x_in)
177 | 
178 |             acts = commons.fused_add_tanh_sigmoid_multiply(
179 |                 x_in,
180 |                 g_l,
181 |                 n_channels_tensor
182 |             )
183 |             acts = self.drop(acts)
184 | 
185 |             res_skip_acts = self.res_skip_layers[i](acts)
186 |             if i < self.n_layers - 1:
187 |                 res_acts = res_skip_acts[:, :self.hidden_channels, :]
188 |                 x = (x + res_acts) * x_mask
189 |                 output = output + res_skip_acts[:, self.hidden_channels:, :]
190 |             else:
191 |                 output = output + res_skip_acts
192 |         return output * x_mask
193 | 
194 |     def remove_weight_norm(self):
195 |         if self.gin_channels != 0:
196 |             torch.nn.utils.remove_weight_norm(self.cond_layer)
197 |         for l in self.in_layers:
198 |             torch.nn.utils.remove_weight_norm(l)
199 |         for l in self.res_skip_layers:
200 |             torch.nn.utils.remove_weight_norm(l)
201 | 
202 | 
203 | class ResBlock1(torch.nn.Module):
204 |     def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
205 |         super(ResBlock1, self).__init__()
206 |         self.convs1 = nn.ModuleList([
207 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
208 |                 padding=get_padding(kernel_size, dilation[0]))),
209 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
210 |                 padding=get_padding(kernel_size, dilation[1]))),
211 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
212 |                 padding=get_padding(kernel_size, dilation[2])))
213 |         ])
214 |         self.convs1.apply(init_weights)
215 | 
216 |         self.convs2 = nn.ModuleList([
217 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
218 |                 padding=get_padding(kernel_size, 1))),
219 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
220 |                 padding=get_padding(kernel_size, 1))),
221 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
222 |                 padding=get_padding(kernel_size, 1)))
223 |         ])
224 |         self.convs2.apply(init_weights)
225 | 
226 |     def forward(self, x, x_mask=None):
227 |         for c1, c2 in zip(self.convs1, self.convs2):
228 |             xt = F.leaky_relu(x, LRELU_SLOPE)
229 |             if x_mask is not None:
230 |                 xt = xt * x_mask
231 |             xt = c1(xt)
232 |             xt = F.leaky_relu(xt, LRELU_SLOPE)
233 |             if x_mask is not None:
234 |                 xt = xt * x_mask
235 |             xt = c2(xt)
236 |             x = xt + x
237 |         if x_mask is not None:
238 |             x = x * x_mask
239 |         return x
240 | 
241 |     def remove_weight_norm(self):
242 |         for l in self.convs1:
243 |             remove_weight_norm(l)
244 |         for l in self.convs2:
245 |             remove_weight_norm(l)
246 | 
247 | 
248 | class ResBlock2(torch.nn.Module):
249 |     def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
250 |         super(ResBlock2, self).__init__()
251 |         self.convs = nn.ModuleList([
252 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
253 |                 padding=get_padding(kernel_size, dilation[0]))),
254 |             weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
255 |                 padding=get_padding(kernel_size, dilation[1])))
256 |         ])
257 |         self.convs.apply(init_weights)
258 | 
259 |     def forward(self, x, x_mask=None):
260 |         for c in self.convs:
261 |             xt = F.leaky_relu(x, LRELU_SLOPE)
262 |             if x_mask is not None:
263 |                 xt = xt * x_mask
264 |             xt = c(xt)
265 |             x = xt + x
266 |         if x_mask is not None:
267 |             x = x * x_mask
268 |         return x
269 | 
270 |     def remove_weight_norm(self):
271 |         for l in self.convs:
272 |             remove_weight_norm(l)
273 | 
274 | 
275 | class Log(nn.Module):
276 |     def forward(self, x, x_mask, reverse=False, **kwargs):
277 |         if not reverse:
278 |             y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask
279 |             logdet = torch.sum(-y, [1, 2])
280 |             return y, logdet
281 |         else:
282 |             x = torch.exp(x) * x_mask
283 |             return x
284 | 
285 | 
286 | class Flip(nn.Module):
287 |     def forward(self, x, *args, reverse=False, **kwargs):
288 |         x = torch.flip(x, [1])
289 |         if not reverse:
290 |             logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device)
291 |             return x, logdet
292 |         else:
293 |             return x
294 | 
295 | 
296 | class ElementwiseAffine(nn.Module):
297 |     def __init__(self, channels):
298 |         super().__init__()
299 |         self.channels = channels
300 |         self.m = nn.Parameter(torch.zeros(channels, 1))
301 |         self.logs = nn.Parameter(torch.zeros(channels, 1))
302 | 
303 |     def forward(self, x, x_mask, reverse=False, **kwargs):
304 |         if not reverse:
305 |             y = self.m + torch.exp(self.logs) * x
306 |             y = y * x_mask
307 |             logdet = torch.sum(self.logs * x_mask, [1, 2])
308 |             return y, logdet
309 |         else:
310 |             x = (x - self.m) * torch.exp(-self.logs) * x_mask
311 |             return x
312 | 
313 | 
314 | class ResidualCouplingLayer(nn.Module):
315 |     def __init__(
316 |         self,
317 |         channels,
318 |         hidden_channels,
319 |         kernel_size,
320 |         dilation_rate,
321 |         n_layers,
322 |         p_dropout=0,
323 |         gin_channels=0,
324 |         mean_only=False
325 |     ):
326 |         assert channels % 2 == 0, "channels should be divisible by 2"
327 |         super().__init__()
328 |         self.channels = channels
329 |         self.hidden_channels = hidden_channels
330 |         self.kernel_size = kernel_size
331 |         self.dilation_rate = dilation_rate
332 |         self.n_layers = n_layers
333 |         self.half_channels = channels // 2
334 |         self.mean_only = mean_only
335 | 
336 |         self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
337 |         self.enc = WN(
338 |             hidden_channels, 
339 |             kernel_size, 
340 |             dilation_rate, 
341 |             n_layers, 
342 |             p_dropout=p_dropout, 
343 |             gin_channels=gin_channels
344 |         )
345 |         self.post = nn.Conv1d(
346 |             hidden_channels, self.half_channels * (2 - mean_only), 1
347 |         )
348 |         self.post.weight.data.zero_()
349 |         self.post.bias.data.zero_()
350 | 
351 |     def forward(self, x, x_mask, g=None, reverse=False):
352 |         x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
353 |         h = self.pre(x0) * x_mask
354 |         h = self.enc(h, x_mask, g=g)
355 |         stats = self.post(h) * x_mask
356 |         if not self.mean_only:
357 |             m, logs = torch.split(stats, [self.half_channels] * 2, 1)
358 |         else:
359 |             m = stats
360 |             logs = torch.zeros_like(m)
361 | 
362 |         if not reverse:
363 |             x1 = m + x1 * torch.exp(logs) * x_mask
364 |             x = torch.cat([x0, x1], 1)
365 |             logdet = torch.sum(logs, [1, 2])
366 |             return x, logdet
367 |         else:
368 |             x1 = (x1 - m) * torch.exp(-logs) * x_mask
369 |             x = torch.cat([x0, x1], 1)
370 |             return x
371 | 
372 | 
373 | class ConvFlow(nn.Module):
374 |     def __init__(self, in_channels, filter_channels, kernel_size, n_layers, num_bins=10, tail_bound=5.0):
375 |         super().__init__()
376 |         self.in_channels = in_channels
377 |         self.filter_channels = filter_channels
378 |         self.kernel_size = kernel_size
379 |         self.n_layers = n_layers
380 |         self.num_bins = num_bins
381 |         self.tail_bound = tail_bound
382 |         self.half_channels = in_channels // 2
383 | 
384 |         self.pre = nn.Conv1d(self.half_channels, filter_channels, 1)
385 |         self.convs = DDSConv(
386 |             filter_channels, kernel_size, n_layers, p_dropout=0.
387 |         )
388 |         self.proj = nn.Conv1d(
389 |             filter_channels, self.half_channels * (num_bins * 3 - 1), 1
390 |         )
391 |         self.proj.weight.data.zero_()
392 |         self.proj.bias.data.zero_()
393 | 
394 |     def forward(self, x, x_mask, g=None, reverse=False):
395 |         x0, x1 = torch.split(x, [self.half_channels]*2, 1)
396 |         h = self.pre(x0)
397 |         h = self.convs(h, x_mask, g=g)
398 |         h = self.proj(h) * x_mask
399 | 
400 |         b, c, t = x0.shape
401 |         # [b, cx?, t] -> [b, c, t, ?]
402 |         h = h.reshape(b, c, -1, t).permute(0, 1, 3, 2)
403 | 
404 |         unnormalized_widths = h[..., :self.num_bins] / \
405 |             math.sqrt(self.filter_channels)
406 |         unnormalized_heights = h[..., self.num_bins:2 * self.num_bins] / \
407 |             math.sqrt(self.filter_channels)
408 |         unnormalized_derivatives = h[..., 2 * self.num_bins:]
409 | 
410 |         x1, logabsdet = piecewise_rational_quadratic_transform(
411 |             x1,
412 |             unnormalized_widths,
413 |             unnormalized_heights,
414 |             unnormalized_derivatives,
415 |             inverse=reverse,
416 |             tails='linear',
417 |             tail_bound=self.tail_bound
418 |         )
419 | 
420 |         x = torch.cat([x0, x1], 1) * x_mask
421 |         logdet = torch.sum(logabsdet * x_mask, [1, 2])
422 |         if not reverse:
423 |             return x, logdet
424 |         else:
425 |             return x
426 | 


--------------------------------------------------------------------------------
/monotonic_align/__init__.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | from .monotonic_align.core import maximum_path_c
 4 | 
 5 | 
 6 | def maximum_path(neg_cent, mask):
 7 |     """ Cython optimized version.
 8 |     neg_cent: [b, t_t, t_s]
 9 |     mask: [b, t_t, t_s]
10 |     """
11 |     device = neg_cent.device
12 |     dtype = neg_cent.dtype
13 |     neg_cent = neg_cent.data.cpu().numpy().astype(np.float32)
14 |     path = np.zeros(neg_cent.shape, dtype=np.int32)
15 | 
16 |     t_t_max = mask.sum(1)[:, 0].data.cpu().numpy().astype(np.int32)
17 |     t_s_max = mask.sum(2)[:, 0].data.cpu().numpy().astype(np.int32)
18 |     maximum_path_c(path, neg_cent, t_t_max, t_s_max)
19 |     return torch.from_numpy(path).to(device=device, dtype=dtype)
20 | 


--------------------------------------------------------------------------------
/monotonic_align/core.pyx:
--------------------------------------------------------------------------------
 1 | cimport cython
 2 | from cython.parallel import prange
 3 | 
 4 | 
 5 | @cython.boundscheck(False)
 6 | @cython.wraparound(False)
 7 | cdef void maximum_path_each(int[:,::1] path, float[:,::1] value, int t_y, int t_x, float max_neg_val=-1e9) nogil:
 8 |   cdef int x
 9 |   cdef int y
10 |   cdef float v_prev
11 |   cdef float v_cur
12 |   cdef float tmp
13 |   cdef int index = t_x - 1
14 | 
15 |   for y in range(t_y):
16 |     for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)):
17 |       if x == y:
18 |         v_cur = max_neg_val
19 |       else:
20 |         v_cur = value[y-1, x]
21 |       if x == 0:
22 |         if y == 0:
23 |           v_prev = 0.
24 |         else:
25 |           v_prev = max_neg_val
26 |       else:
27 |         v_prev = value[y-1, x-1]
28 |       value[y, x] += max(v_prev, v_cur)
29 | 
30 |   for y in range(t_y - 1, -1, -1):
31 |     path[y, index] = 1
32 |     if index != 0 and (index == y or value[y-1, index] < value[y-1, index-1]):
33 |       index = index - 1
34 | 
35 | 
36 | @cython.boundscheck(False)
37 | @cython.wraparound(False)
38 | cpdef void maximum_path_c(int[:,:,::1] paths, float[:,:,::1] values, int[::1] t_ys, int[::1] t_xs) nogil:
39 |   cdef int b = paths.shape[0]
40 |   cdef int i
41 |   for i in prange(b, nogil=True):
42 |     maximum_path_each(paths[i], values[i], t_ys[i], t_xs[i])
43 | 


--------------------------------------------------------------------------------
/monotonic_align/setup.py:
--------------------------------------------------------------------------------
 1 | from distutils.core import setup
 2 | from Cython.Build import cythonize
 3 | import numpy
 4 | 
 5 | setup(
 6 |     name='monotonic_align',
 7 |     ext_modules=cythonize("core.pyx"),
 8 |     include_dirs=[numpy.get_include()]
 9 | )
10 | 


--------------------------------------------------------------------------------
/pqmf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Copyright 2020 Tomoki Hayashi
  4 | #  MIT License (https://opensource.org/licenses/MIT)
  5 | 
  6 | """Pseudo QMF modules."""
  7 | '''
  8 | Copied from https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/parallel_wavegan/layers/pqmf.py
  9 | '''
 10 | 
 11 | import numpy as np
 12 | import torch
 13 | import torch.nn.functional as F
 14 | 
 15 | from scipy.signal import kaiser
 16 | 
 17 | 
 18 | def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0):
 19 |     """Design prototype filter for PQMF.
 20 |     This method is based on `A Kaiser window approach for the design of prototype
 21 |     filters of cosine modulated filterbanks`_.
 22 |     Args:
 23 |         taps (int): The number of filter taps.
 24 |         cutoff_ratio (float): Cut-off frequency ratio.
 25 |         beta (float): Beta coefficient for kaiser window.
 26 |     Returns:
 27 |         ndarray: Impluse response of prototype filter (taps + 1,).
 28 |     .. _`A Kaiser window approach for the design of prototype filters of cosine modulated filterbanks`:
 29 |         https://ieeexplore.ieee.org/abstract/document/681427
 30 |     """
 31 |     # check the arguments are valid
 32 |     assert taps % 2 == 0, "The number of taps mush be even number."
 33 |     assert 0.0 < cutoff_ratio < 1.0, "Cutoff ratio must be > 0.0 and < 1.0."
 34 | 
 35 |     # make initial filter
 36 |     omega_c = np.pi * cutoff_ratio
 37 |     with np.errstate(invalid="ignore"):
 38 |         h_i = np.sin(omega_c * (np.arange(taps + 1) - 0.5 * taps)) / (
 39 |             np.pi * (np.arange(taps + 1) - 0.5 * taps)
 40 |         )
 41 |     h_i[taps // 2] = np.cos(0) * cutoff_ratio  # fix nan due to indeterminate form
 42 | 
 43 |     # apply kaiser window
 44 |     w = kaiser(taps + 1, beta)
 45 |     h = h_i * w
 46 | 
 47 |     return h
 48 | 
 49 | 
 50 | class PQMF(torch.nn.Module):
 51 |     """PQMF module.
 52 |     This module is based on `Near-perfect-reconstruction pseudo-QMF banks`_.
 53 |     .. _`Near-perfect-reconstruction pseudo-QMF banks`:
 54 |         https://ieeexplore.ieee.org/document/258122
 55 |     """
 56 | 
 57 |     def __init__(self, subbands=4, taps=62, cutoff_ratio=0.142, beta=9.0):
 58 |         """Initilize PQMF module.
 59 |         The cutoff_ratio and beta parameters are optimized for #subbands = 4.
 60 |         See dicussion in https://github.com/kan-bayashi/ParallelWaveGAN/issues/195.
 61 |         Args:
 62 |             subbands (int): The number of subbands.
 63 |             taps (int): The number of filter taps.
 64 |             cutoff_ratio (float): Cut-off frequency ratio.
 65 |             beta (float): Beta coefficient for kaiser window.
 66 |         """
 67 |         super(PQMF, self).__init__()
 68 | 
 69 |         # build analysis & synthesis filter coefficients
 70 |         h_proto = design_prototype_filter(taps, cutoff_ratio, beta)
 71 |         h_analysis = np.zeros((subbands, len(h_proto)))
 72 |         h_synthesis = np.zeros((subbands, len(h_proto)))
 73 |         for k in range(subbands):
 74 |             h_analysis[k] = (
 75 |                 2
 76 |                 * h_proto
 77 |                 * np.cos(
 78 |                     (2 * k + 1)
 79 |                     * (np.pi / (2 * subbands))
 80 |                     * (np.arange(taps + 1) - (taps / 2))
 81 |                     + (-1) ** k * np.pi / 4
 82 |                 )
 83 |             )
 84 |             h_synthesis[k] = (
 85 |                 2
 86 |                 * h_proto
 87 |                 * np.cos(
 88 |                     (2 * k + 1)
 89 |                     * (np.pi / (2 * subbands))
 90 |                     * (np.arange(taps + 1) - (taps / 2))
 91 |                     - (-1) ** k * np.pi / 4
 92 |                 )
 93 |             )
 94 | 
 95 |         # convert to tensor
 96 |         analysis_filter = torch.Tensor(h_analysis).float().unsqueeze(1)
 97 |         synthesis_filter = torch.Tensor(h_synthesis).float().unsqueeze(0)
 98 | 
 99 |         # register coefficients as beffer
100 |         self.register_buffer("analysis_filter", analysis_filter)
101 |         self.register_buffer("synthesis_filter", synthesis_filter)
102 | 
103 |         # filter for downsampling & upsampling
104 |         updown_filter = torch.zeros((subbands, subbands, subbands)).float()
105 |         for k in range(subbands):
106 |             updown_filter[k, k, 0] = 1.0
107 |         self.register_buffer("updown_filter", updown_filter)
108 |         self.subbands = subbands
109 | 
110 |         # keep padding info
111 |         self.pad_fn = torch.nn.ConstantPad1d(taps // 2, 0.0)
112 | 
113 |     def analysis(self, x):
114 |         """Analysis with PQMF.
115 |         Args:
116 |             x (Tensor): Input tensor (B, 1, T).
117 |         Returns:
118 |             Tensor: Output tensor (B, subbands, T // subbands).
119 |         """
120 |         x = F.conv1d(self.pad_fn(x), self.analysis_filter)
121 |         return F.conv1d(x, self.updown_filter, stride=self.subbands)
122 | 
123 |     def synthesis(self, x):
124 |         """Synthesis with PQMF.
125 |         Args:
126 |             x (Tensor): Input tensor (B, subbands, T // subbands).
127 |         Returns:
128 |             Tensor: Output tensor (B, 1, T).
129 |         """
130 |         # NOTE(kan-bayashi): Power will be dreased so here multipy by # subbands.
131 |         #   Not sure this is the correct way, it is better to check again.
132 |         # TODO(kan-bayashi): Understand the reconstruction procedure
133 |         x = F.conv_transpose1d(
134 |             x, self.updown_filter * self.subbands, stride=self.subbands
135 |         )
136 |         return F.conv1d(self.pad_fn(x), self.synthesis_filter)
137 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | Cython
 2 | matplotlib
 3 | tensorboard
 4 | kiwipiepy
 5 | librosa==0.8.0
 6 | numpy
 7 | scipy
 8 | Unidecode
 9 | omegaconf
10 | alias_free_torch
11 | phaseaug
12 | pyopenjtalk==0.2.0
13 | inflect
14 | phaseaug
15 | soundcard 
16 | soundfile
17 | 


--------------------------------------------------------------------------------
/text/__init__.py:
--------------------------------------------------------------------------------
  1 | """ from https://github.com/keithito/tacotron """
  2 | import re
  3 | from unicodedata import normalize
  4 | 
  5 | from text.cleaners import collapse_whitespace
  6 | from text.symbols import lang_to_dict, lang_to_dict_inverse
  7 | 
  8 | 
  9 | def text_to_sequence(raw_text, lang):
 10 |     '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
 11 |     Args:
 12 |         text: string to convert to a sequence
 13 |         lang: language of the input text
 14 |     Returns:
 15 |         List of integers corresponding to the symbols in the text
 16 |     '''
 17 | 
 18 |     _symbol_to_id = lang_to_dict(lang)
 19 |     text = collapse_whitespace(raw_text)
 20 | 
 21 |     if lang == 'ko_KR':    
 22 |         text = normalize('NFKD', text)
 23 |         sequence = [_symbol_to_id[symbol] for symbol in text]
 24 |         tone = [0 for i in sequence]
 25 | 
 26 |     elif lang == 'en_US':
 27 |         _curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')
 28 |         sequence = []
 29 | 
 30 |         while len(text):
 31 |             m = _curly_re.match(text)
 32 | 
 33 |             if m is not None:
 34 |                 ar = m.group(1)
 35 |                 sequence += [_symbol_to_id[symbol] for symbol in ar]
 36 |                 ar = m.group(2)
 37 |                 sequence += [_symbol_to_id[symbol] for symbol in ar.split()]
 38 |                 text = m.group(3)
 39 |             else:
 40 |                 sequence += [_symbol_to_id[symbol] for symbol in text]
 41 |                 break
 42 | 
 43 |         tone = [0 for i in sequence]
 44 | 
 45 |     ### Add ###
 46 |     elif lang == 'pyopenjtalk_prosody':
 47 |         sequence = []
 48 |         phonomes = pyopenjtalk_g2p_prosody(text)                    # 文章⇒音素
 49 |         sequence += [_symbol_to_id[symbol] for symbol in phonomes]  # 音素⇒index番号
 50 |         tone = [0 for i in sequence]                                # 音素分だけtoneも追加(なにこれ?)
 51 |     ###########
 52 | 
 53 |     else:
 54 |         raise RuntimeError('Wrong type of lang')
 55 | 
 56 |     assert len(sequence) == len(tone)
 57 |     return sequence, tone
 58 | 
 59 | 
 60 | def sequence_to_text(sequence, lang):
 61 |     '''Converts a sequence of IDs back to a string'''
 62 |     _id_to_symbol = lang_to_dict_inverse(lang)
 63 |     result = ''
 64 |     for symbol_id in sequence:
 65 |         s = _id_to_symbol[symbol_id]
 66 |         result += s
 67 |     return result
 68 | 
 69 | 
 70 | def _clean_text(text, cleaner_names):
 71 |     for name in cleaner_names:
 72 |         cleaner = getattr(cleaners, name)
 73 |         if not cleaner:
 74 |             raise Exception('Unknown cleaner: %s' % name)
 75 |         text = cleaner(text)
 76 |     return text
 77 | 
 78 | ### Add from espnet ### 
 79 | # ESPNet:https://github.com/espnet/espnet
 80 | #######################
 81 | def pyopenjtalk_g2p_prosody(text: str, drop_unvoiced_vowels: bool = True) :
 82 |     """Extract phoneme + prosoody symbol sequence from input full-context labels.
 83 | 
 84 |     The algorithm is based on `Prosodic features control by symbols as input of
 85 |     sequence-to-sequence acoustic modeling for neural TTS`_ with some r9y9's tweaks.
 86 | 
 87 |     Args:
 88 |         text (str): Input text.
 89 |         drop_unvoiced_vowels (bool): whether to drop unvoiced vowels.
 90 | 
 91 |     Returns:
 92 |         List[str]: List of phoneme + prosody symbols.
 93 | 
 94 |     Examples:
 95 |         >>> from espnet2.text.phoneme_tokenizer import pyopenjtalk_g2p_prosody
 96 |         >>> pyopenjtalk_g2p_prosody("こんにちは。")
 97 |         ['^', 'k', 'o', '[', 'N', 'n', 'i', 'ch', 'i', 'w', 'a', '$']
 98 | 
 99 |     .. _`Prosodic features control by symbols as input of sequence-to-sequence acoustic
100 |         modeling for neural TTS`: https://doi.org/10.1587/transinf.2020EDP7104
101 | 
102 |     """
103 |     labels = _extract_fullcontext_label(text)
104 |     N = len(labels)
105 | 
106 |     phones = []
107 |     for n in range(N):
108 |         lab_curr = labels[n]
109 | 
110 |         # current phoneme
111 |         p3 = re.search(r"\-(.*?)\+", lab_curr).group(1)
112 | 
113 |         # deal unvoiced vowels as normal vowels
114 |         if drop_unvoiced_vowels and p3 in "AEIOU":
115 |             p3 = p3.lower()
116 | 
117 |         # deal with sil at the beginning and the end of text
118 |         if p3 == "sil":
119 |             assert n == 0 or n == N - 1
120 |             if n == 0:
121 |                 phones.append("^")
122 |             elif n == N - 1:
123 |                 # check question form or not
124 |                 e3 = _numeric_feature_by_regex(r"!(\d+)_", lab_curr)
125 |                 if e3 == 0:
126 |                     phones.append("$")
127 |                 elif e3 == 1:
128 |                     phones.append("?")
129 |             continue
130 |         elif p3 == "pau":
131 |             phones.append("_")
132 |             continue
133 |         else:
134 |             phones.append(p3)
135 | 
136 |         # accent type and position info (forward or backward)
137 |         a1 = _numeric_feature_by_regex(r"/A:([0-9\-]+)\+", lab_curr)
138 |         a2 = _numeric_feature_by_regex(r"\+(\d+)\+", lab_curr)
139 |         a3 = _numeric_feature_by_regex(r"\+(\d+)/", lab_curr)
140 | 
141 |         # number of mora in accent phrase
142 |         f1 = _numeric_feature_by_regex(r"/F:(\d+)_", lab_curr)
143 | 
144 |         a2_next = _numeric_feature_by_regex(r"\+(\d+)\+", labels[n + 1])
145 |         # accent phrase border
146 |         if a3 == 1 and a2_next == 1 and p3 in "aeiouAEIOUNcl":
147 |             phones.append("#")
148 |         # pitch falling
149 |         elif a1 == 0 and a2_next == a2 + 1 and a2 != f1:
150 |             phones.append("]")
151 |         # pitch rising
152 |         elif a2 == 1 and a2_next == 2:
153 |             phones.append("[")
154 | 
155 |     return phones
156 | 
157 | from packaging.version import parse as V
158 | def _extract_fullcontext_label(text):
159 |     import pyopenjtalk
160 | 
161 |     if V(pyopenjtalk.__version__) >= V("0.3.0"):
162 |         return pyopenjtalk.make_label(pyopenjtalk.run_frontend(text))
163 |     else:
164 |         return pyopenjtalk.run_frontend(text)[1]
165 | 
166 | 
167 | def _numeric_feature_by_regex(regex, s):
168 |     match = re.search(regex, s)
169 |     if match is None:
170 |         return -50
171 |     return int(match.group(1))
172 | 
173 | #######################


--------------------------------------------------------------------------------
/text/cleaners.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | 
 3 | '''
 4 | Cleaners are transformations that run over the input text at both training and eval time.
 5 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
 6 | hyperparameter. Some cleaners are English-specific. You'll typically want to use:
 7 |   1. "english_cleaners" for English text
 8 |   2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
 9 |      the Unidecode library (https://pypi.python.org/pypi/Unidecode)
10 |   3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
11 |      the symbols in symbols.py to match your data).
12 | '''
13 | 
14 | import re
15 | from unidecode import unidecode
16 | from unicodedata import normalize
17 | 
18 | from .numbers import normalize_numbers
19 | 
20 | 
21 | # Regular expression matching whitespace:
22 | _whitespace_re = re.compile(r'\s+')
23 | 
24 | # List of (regular expression, replacement) pairs for abbreviations:
25 | _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
26 |     ('mrs', 'misess'),
27 |     ('mr', 'mister'),
28 |     ('dr', 'doctor'),
29 |     ('st', 'saint'),
30 |     ('co', 'company'),
31 |     ('jr', 'junior'),
32 |     ('maj', 'major'),
33 |     ('gen', 'general'),
34 |     ('drs', 'doctors'),
35 |     ('rev', 'reverend'),
36 |     ('lt', 'lieutenant'),
37 |     ('hon', 'honorable'),
38 |     ('sgt', 'sergeant'),
39 |     ('capt', 'captain'),
40 |     ('esq', 'esquire'),
41 |     ('ltd', 'limited'),
42 |     ('col', 'colonel'),
43 |     ('ft', 'fort'),
44 | ]]
45 | 
46 | _cht_norm = [(re.compile(r'[%s]' % x[0]), x[1]) for x in [
47 |     ('。．；', '.'),
48 |     ('，、', ', '),
49 |     ('？', '?'),
50 |     ('！', '!'),
51 |     ('─‧', '-'),
52 |     ('…', '...'),
53 |     ('《》「」『』〈〉（）', "'"),
54 |     ('：︰', ':'),
55 |     ('　', ' ')
56 | ]]
57 | 
58 | def expand_abbreviations(text):
59 |     for regex, replacement in _abbreviations:
60 |         text = re.sub(regex, replacement, text)
61 |     return text
62 | 
63 | def expand_numbers(text):
64 |     return normalize_numbers(text)
65 | 
66 | def lowercase(text):
67 |     return text.lower()
68 | 
69 | def collapse_whitespace(text):
70 |     return re.sub(_whitespace_re, ' ', text)
71 | 
72 | def convert_to_ascii(text):
73 |     return unidecode(text)
74 | 
75 | def english_cleaners(text):
76 |     '''Pipeline for English text, including abbreviation expansion.'''
77 |     text = convert_to_ascii(text)
78 |     #text = lowercase(text)
79 |     text = expand_numbers(text)
80 |     text = expand_abbreviations(text)
81 |     text = collapse_whitespace(text)
82 |     return text
83 | 
84 | def korean_cleaners(text):
85 |     '''Pipeline for Korean text, including collapses whitespace.'''
86 |     text = collapse_whitespace(text)
87 |     text = normalize('NFKD', text)
88 |     return text
89 |    
90 | 


--------------------------------------------------------------------------------
/text/numbers.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | 
 3 | import inflect
 4 | import re
 5 | 
 6 | 
 7 | _inflect = inflect.engine()
 8 | _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
 9 | _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
10 | _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
11 | _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
12 | _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
13 | _number_re = re.compile(r'[0-9]+')
14 | 
15 | 
16 | def _remove_commas(m):
17 |     return m.group(1).replace(',', '')
18 | 
19 | 
20 | def _expand_decimal_point(m):
21 |     return m.group(1).replace('.', ' point ')
22 | 
23 | 
24 | def _expand_dollars(m):
25 |     match = m.group(1)
26 |     parts = match.split('.')
27 |     if len(parts) > 2:
28 |         return match + ' dollars'  # Unexpected format
29 |     dollars = int(parts[0]) if parts[0] else 0
30 |     cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
31 |     if dollars and cents:
32 |         dollar_unit = 'dollar' if dollars == 1 else 'dollars'
33 |         cent_unit = 'cent' if cents == 1 else 'cents'
34 |         return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
35 |     elif dollars:
36 |         dollar_unit = 'dollar' if dollars == 1 else 'dollars'
37 |         return '%s %s' % (dollars, dollar_unit)
38 |     elif cents:
39 |         cent_unit = 'cent' if cents == 1 else 'cents'
40 |         return '%s %s' % (cents, cent_unit)
41 |     else:
42 |         return 'zero dollars'
43 | 
44 | 
45 | def _expand_ordinal(m):
46 |     return _inflect.number_to_words(m.group(0))
47 | 
48 | 
49 | def _expand_number(m):
50 |     num = int(m.group(0))
51 |     if num > 1000 and num < 3000:
52 |         if num == 2000:
53 |             return 'two thousand'
54 |         elif num > 2000 and num < 2010:
55 |             return 'two thousand ' + _inflect.number_to_words(num % 100)
56 |         elif num % 100 == 0:
57 |             return _inflect.number_to_words(num // 100) + ' hundred'
58 |         else:
59 |             return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
60 |     else:
61 |         return _inflect.number_to_words(num, andword='')
62 | 
63 | 
64 | def normalize_numbers(text):
65 |     text = re.sub(_comma_number_re, _remove_commas, text)
66 |     text = re.sub(_pounds_re, r'\1 pounds', text)
67 |     text = re.sub(_dollars_re, _expand_dollars, text)
68 |     text = re.sub(_decimal_number_re, _expand_decimal_point, text)
69 |     text = re.sub(_ordinal_re, _expand_ordinal, text)
70 |     text = re.sub(_number_re, _expand_number, text)
71 |     return text


--------------------------------------------------------------------------------
/text/symbols.py:
--------------------------------------------------------------------------------
 1 | _pad = '_'
 2 | _punc = ";:,.!?¡¿—-…«»'“”~() "
 3 | 
 4 | _jamo_leads = "".join([chr(_) for _ in range(0x1100, 0x1113)])
 5 | _jamo_vowels = "".join([chr(_) for _ in range(0x1161, 0x1176)])
 6 | _jamo_tails = "".join([chr(_) for _ in range(0x11A8, 0x11C3)])
 7 | _kor_characters = _jamo_leads + _jamo_vowels + _jamo_tails
 8 | 
 9 | _cmu_characters = [
10 |     'AA', 'AE', 'AH',
11 |     'AO', 'AW', 'AY',
12 |     'B', 'CH', 'D', 'DH', 'EH', 'ER', 'EY',
13 |     'F', 'G', 'HH', 'IH', 'IY',
14 |     'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OY',
15 |     'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UW',
16 |     'V', 'W', 'Y', 'Z', 'ZH'
17 | ]
18 | 
19 | ### add Japanese phonomes by pyopenjtalk_g2p_prosody ###
20 | _ja_characters = [
21 |     '_' , '#' , '$' , '[' , '?' , ']' , '^' ,  
22 |     'a' , 'b' , 'by', 'ch', 'cl', 'd' , 'dy',
23 |     'e' , 'f' , 'g' , 'gy', 'h' , 'hy', 'i' ,
24 |     'j' , 'k' , 'ky', 'm' , 'my', 'n' , 'N' , 
25 |     'ny', 'o' , 'p' , 'py', 'r' , 'ry', 's' , 
26 |     'sh', 't' , 'ts', 'ty', 'u' , 'v'  , 'w', 
27 |     'y' , 'z' ]
28 | ########################################################
29 | 
30 | lang_to_symbols = {
31 |     'common': [_pad] + list(_punc),
32 |     'ko_KR': list(_kor_characters), 
33 |     'en_US': _cmu_characters, 
34 |     'pyopenjtalk_prosody': _ja_characters, 
35 | }
36 | 
37 | def lang_to_dict(lang):
38 | 
39 |     ### add ###
40 |     if lang == "pyopenjtalk_prosody":
41 |         symbol_lang = lang_to_symbols[lang]
42 |     else:
43 |         symbol_lang = lang_to_symbols['common'] + lang_to_symbols[lang]
44 |     ############
45 | 
46 |     dict_lang = {s: i for i, s in enumerate(symbol_lang)}
47 |     return dict_lang
48 | 
49 | def lang_to_dict_inverse(lang):
50 | 
51 |     ### add ###
52 |     if lang == "pyopenjtalk_prosody":
53 |         symbol_lang = lang_to_symbols[lang]
54 |     else:
55 |         symbol_lang = lang_to_symbols['common'] + lang_to_symbols[lang]
56 |     ###########
57 | 
58 |     dict_lang = {i: s for i, s in enumerate(symbol_lang)}
59 |     return dict_lang
60 | 
61 | def symbol_len(lang):
62 | 
63 |     ### add ###
64 |     if lang == "pyopenjtalk_prosody":
65 |         symbol_lang = lang_to_symbols[lang]
66 |     else:
67 |         symbol_lang = lang_to_symbols['common'] + lang_to_symbols[lang]
68 |     ###########
69 | 
70 |     return len(symbol_lang)
71 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | # modified from https://github.com/jaywalnut310/vits
  2 | import os
  3 | import argparse
  4 | import torch
  5 | from torch.nn import functional as F
  6 | from torch.utils.data import DataLoader
  7 | from torch.utils.tensorboard import SummaryWriter
  8 | import torch.multiprocessing as mp
  9 | import torch.distributed as dist
 10 | from torch.nn.parallel import DistributedDataParallel as DDP
 11 | from torch.cuda.amp import autocast, GradScaler
 12 | from tqdm import tqdm
 13 | from phaseaug.phaseaug import PhaseAug
 14 | import commons
 15 | import utils
 16 | from data_utils import (TextAudioSpeakerLoader, TextAudioSpeakerCollate,
 17 |                         DistributedBucketSampler, create_spec, dataset_check)
 18 | from models import (
 19 |     SynthesizerTrn,
 20 |     #    MultiPeriodDiscriminator,
 21 |     AvocodoDiscriminator)
 22 | from losses import (generator_loss, discriminator_loss, feature_loss, kl_loss)
 23 | from mel_processing import mel_spectrogram_torch, spec_to_mel_torch
 24 | from text.symbols import symbol_len
 25 | import math
 26 | 
 27 | torch.backends.cudnn.benchmark = True
 28 | global_step = 0
 29 | 
 30 | 
 31 | def main(args):
 32 |     """Assume Single Node Multi GPUs Training Only"""
 33 |     assert torch.cuda.is_available(), "CPU training is not allowed."
 34 | 
 35 |     n_gpus = torch.cuda.device_count()
 36 |     os.environ['MASTER_ADDR'] = 'localhost'
 37 |     os.environ['MASTER_PORT'] = '12345'     ### modify ###
 38 | 
 39 |     hps = utils.get_hparams(args)
 40 |     # create spectrogram files ### modify ### 
 41 |     #create_spec(hps.data.training_files, hps.data)
 42 |     #create_spec(hps.data.validation_files, hps.data)
 43 |     dataset_check(hps.data.validation_files, hps.data)
 44 |     dataset_check(hps.data.training_files, hps.data)
 45 |     
 46 |     run(rank=0, n_gpus=1, hps=hps, args=args)               ### modify ### 
 47 |     #mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps, args))
 48 | 
 49 | 
 50 | def count_parameters(model, scale=1000000):
 51 |     return sum(p.numel()
 52 |                for p in model.parameters() if p.requires_grad) / scale
 53 | 
 54 | 
 55 | def run(rank, n_gpus, hps, args):
 56 |     global global_step
 57 |     if rank == 0:
 58 |         logger = utils.get_logger(hps.model_dir)
 59 |         logger.info('MODEL NAME: {} in {}'.format(args.model, hps.model_dir))
 60 |         logger.info(
 61 |             'GPU: Use {} gpu(s) with batch size {} (FP16 running: {})'.format(
 62 |                 n_gpus, hps.train.batch_size, hps.train.fp16_run))
 63 | 
 64 |         utils.check_git_hash(hps.model_dir)
 65 |         writer = SummaryWriter(log_dir=hps.model_dir)
 66 | 
 67 |     dist.init_process_group(backend='nccl',
 68 |                             init_method='env://',
 69 |                             world_size=n_gpus,
 70 |                             rank=rank,
 71 |                             group_name=args.model)
 72 |     torch.manual_seed(hps.train.seed)
 73 |     torch.cuda.set_device(rank)
 74 | 
 75 |     use_persistent_workers = hps.data.persistent_workers
 76 |     use_pin_memory = not use_persistent_workers
 77 |     train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps.data,
 78 |                                            rank == 0 and args.initial_run)
 79 |     collate_fn = TextAudioSpeakerCollate()
 80 |     if rank == 0:
 81 |         eval_dataset = TextAudioSpeakerLoader(hps.data.validation_files,
 82 |                                               hps.data, rank == 0
 83 |                                               and args.initial_run)
 84 |         eval_loader = DataLoader(
 85 |             eval_dataset,
 86 |             num_workers=10,
 87 |             shuffle=False,
 88 |             batch_size=hps.train.batch_size,
 89 |             pin_memory=use_pin_memory,
 90 |             drop_last=False,
 91 |             collate_fn=collate_fn,
 92 |             persistent_workers=use_persistent_workers,
 93 |         )
 94 |     elif args.initial_run:
 95 |         print(f'rank: {rank} is waiting...')
 96 |     dist.barrier()
 97 |     if rank == 0:
 98 |         logger.info('Training Started')
 99 |     train_sampler = DistributedBucketSampler(
100 |         train_dataset,
101 |         hps.train.batch_size, [32, 300, 400, 500, 600, 700, 800, 900, 1000],
102 |         num_replicas=n_gpus,
103 |         rank=rank,
104 |         shuffle=True)
105 |     train_loader = DataLoader(train_dataset,
106 |                               num_workers=10,
107 |                               shuffle=False,
108 |                               pin_memory=use_pin_memory,
109 |                               collate_fn=collate_fn,
110 |                               persistent_workers=use_persistent_workers,
111 |                               batch_sampler=train_sampler)
112 | 
113 |     net_g = SynthesizerTrn(symbol_len(hps.data.languages),
114 |                            hps.data.filter_length // 2 + 1,
115 |                            hps.train.segment_size // hps.data.hop_length,
116 |                            n_speakers=len(hps.data.speakers),
117 |                            midi_start=hps.data.midi_start,
118 |                            midi_end=hps.data.midi_end,
119 |                            octave_range=hps.data.octave_range,
120 |                            **hps.model,
121 | 
122 |                            ### add ###
123 |                            sr=hps.data.sampling_rate,
124 |                            W=hps.data.ying_window,
125 |                            w_step=hps.data.ying_hop,
126 |                            tau_max=hps.data.tau_max
127 |                            ###########
128 | 
129 |                            ).cuda(rank)
130 |     net_d = AvocodoDiscriminator(hps.model.use_spectral_norm, hps.train.segment_size).cuda(rank)
131 |     if rank == 0:
132 |         logger.info('MODEL SIZE: G {:.2f}M and D {:.2f}M'.format(
133 |             count_parameters(net_g),
134 |             count_parameters(net_d),
135 |         ))
136 | 
137 |     optim_g = torch.optim.AdamW(net_g.parameters(),
138 |                                 hps.train.learning_rate,
139 |                                 betas=hps.train.betas,
140 |                                 eps=hps.train.eps)
141 |     optim_d = torch.optim.AdamW(net_d.parameters(),
142 |                                 hps.train.learning_rate,
143 |                                 betas=hps.train.betas,
144 |                                 eps=hps.train.eps)
145 |     #net_g = DDP(net_g, device_ids=[rank], find_unused_parameters=True)
146 |     #net_d = DDP(net_d, device_ids=[rank], find_unused_parameters=True)
147 |     net_g = DDP(net_g, device_ids=[rank], find_unused_parameters=False) ### modify ###
148 |     net_d = DDP(net_d, device_ids=[rank], find_unused_parameters=False) ### modify ###
149 | 
150 |     if args.transfer:
151 |         _, _, _, _, _, _, _ = utils.load_checkpoint(args.transfer, rank, net_g,
152 |                                                     net_d, None, None)
153 |         epoch_str = 1
154 |         global_step = 0
155 | 
156 |     elif args.force_resume:
157 |         _, _, _, epoch_save, _ = utils.load_checkpoint_diffsize(
158 |             args.force_resume, rank, net_g, net_d)
159 |         epoch_str = epoch_save + 1
160 |         global_step = epoch_save * len(train_loader) + 1
161 |         scheduler_g = torch.optim.lr_scheduler.ExponentialLR(
162 |             optim_g, gamma=hps.train.lr_decay, last_epoch=-1)
163 |         scheduler_d = torch.optim.lr_scheduler.ExponentialLR(
164 |             optim_d, gamma=hps.train.lr_decay, last_epoch=-1)
165 |     elif args.resume:
166 |         _, _, _, _, _, epoch_save, _ = utils.load_checkpoint(
167 |             args.resume, rank, net_g, net_d, optim_g, optim_d)
168 |         epoch_str = epoch_save + 1
169 |         global_step = epoch_save * len(train_loader) + 1
170 |     else:
171 |         epoch_str = 1
172 |         global_step = 0
173 | 
174 |     scheduler_g = torch.optim.lr_scheduler.ExponentialLR(
175 |         optim_g, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
176 |     scheduler_d = torch.optim.lr_scheduler.ExponentialLR(
177 |         optim_d, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
178 | 
179 |     scaler = GradScaler(enabled=hps.train.fp16_run)
180 |     if rank == 0:
181 |         outer_bar = tqdm(total=hps.train.epochs,
182 |                          desc="Training",
183 |                          position=0,
184 |                          leave=False)
185 |         outer_bar.update(epoch_str)
186 | 
187 |     for epoch in range(epoch_str, hps.train.epochs + 1):
188 |         if rank == 0:
189 |             train_and_evaluate(rank, epoch, hps, [net_g, net_d],
190 |                                [optim_g, optim_d], [scheduler_g, scheduler_d],
191 |                                scaler, [train_loader, eval_loader], writer)
192 |         else:
193 |             train_and_evaluate(rank, epoch, hps, [net_g, net_d],
194 |                                [optim_g, optim_d], [scheduler_g, scheduler_d],
195 |                                scaler, [train_loader, None], None)
196 |         scheduler_g.step()
197 |         scheduler_d.step()
198 |         if rank == 0:
199 |             outer_bar.update(1)
200 | 
201 | 
202 | def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scaler,
203 |                        loaders, writer):
204 |     net_g, net_d = nets
205 |     optim_g, optim_d = optims
206 |     scheduler_g, scheduler_d = schedulers
207 |     train_loader, eval_loader = loaders
208 |     aug = PhaseAug().cuda(rank)
209 |     train_loader.batch_sampler.set_epoch(epoch)
210 |     global global_step
211 | 
212 |     net_g.train()
213 |     net_d.train()
214 |     if rank == 0:
215 |         inner_bar = tqdm(total=len(train_loader),
216 |                          desc="Epoch {}".format(epoch),
217 |                          position=1,
218 |                          leave=False)
219 | 
220 |     for batch_idx, (x, x_lengths, spec, spec_lengths, ying, ying_lengths, y,
221 |                     y_lengths, speakers, tone) in enumerate(train_loader):
222 |         x, x_lengths = x.cuda(rank, non_blocking=True), x_lengths.cuda(
223 |             rank, non_blocking=True)
224 |         spec, spec_lengths = spec.cuda(
225 |             rank, non_blocking=True), spec_lengths.cuda(rank,
226 |                                                         non_blocking=True)
227 |         ying, ying_lengths = ying.cuda(
228 |             rank, non_blocking=True), ying_lengths.cuda(rank,
229 |                                                         non_blocking=True)
230 | 
231 |         y, y_lengths = y.cuda(rank, non_blocking=True), y_lengths.cuda(
232 |             rank, non_blocking=True)
233 |         speakers = speakers.cuda(rank, non_blocking=True)
234 |         tone = tone.cuda(rank, non_blocking=True)
235 | 
236 |         with autocast(enabled=hps.train.fp16_run):
237 |             y_hat, l_length, attn, ids_slice, x_mask, z_mask, y_hat_, \
238 |             (z, z_p, m_p, logs_p, m_q, logs_q), _, \
239 |                 (z_spec, m_spec, logs_spec, spec_mask, z_yin, m_yin, logs_yin, yin_mask), \
240 |                 (yin_gt_crop, yin_gt_shifted_crop, yin_dec_crop, yin_hat_crop, scope_shift, yin_hat_shifted) \
241 |                  = net_g(
242 |                     x, tone, x_lengths, spec, spec_lengths, ying, ying_lengths, speakers
243 |                 )
244 |             mel = spec_to_mel_torch(spec, hps.data.filter_length,
245 |                                     hps.data.n_mel_channels,
246 |                                     hps.data.sampling_rate, hps.data.mel_fmin,
247 |                                     hps.data.mel_fmax)
248 |             y_mel = commons.slice_segments(
249 |                 mel, ids_slice, hps.train.segment_size // hps.data.hop_length)
250 |             y_hat_mel = mel_spectrogram_torch(
251 |                 y_hat[-1].squeeze(1), hps.data.filter_length,
252 |                 hps.data.n_mel_channels, hps.data.sampling_rate,
253 |                 hps.data.hop_length, hps.data.win_length, hps.data.mel_fmin,
254 |                 hps.data.mel_fmax)
255 |             yin_gt_crop = commons.slice_segments(
256 |                 torch.cat([yin_gt_crop, yin_gt_shifted_crop], dim=0),
257 |                 ids_slice, hps.train.segment_size // hps.data.hop_length)
258 | 
259 |             y_ = commons.slice_segments(torch.cat([y, y], dim=0),
260 |                                         ids_slice * hps.data.hop_length,
261 |                                         hps.train.segment_size)  # sliced
262 |             # Discriminator
263 |             with autocast(enabled=False):
264 |                 aug_y_, aug_y_hat_last = aug.forward_sync(
265 |                     y_, y_hat_[-1].detach())
266 |                 aug_y_hat_ = [_y.detach() for _y in y_hat_[:-1]]
267 |                 aug_y_hat_.append(aug_y_hat_last)
268 |             y_d_hat_r, y_d_hat_g, _, _ = net_d(aug_y_, aug_y_hat_)
269 |             with autocast(enabled=False):
270 |                 loss_disc, losses_disc_r, losses_disc_g = discriminator_loss(
271 |                     y_d_hat_r, y_d_hat_g)
272 |                 loss_disc_all = loss_disc
273 | 
274 |         optim_d.zero_grad()
275 |         scaler.scale(loss_disc_all).backward()
276 |         scaler.unscale_(optim_d)
277 |         grad_norm_d = commons.clip_grad_value_(net_d.parameters(), None)
278 |         scaler.step(optim_d)
279 | 
280 |         p = float(batch_idx + epoch *
281 |                   len(train_loader)) / hps.train.alpha / len(train_loader)
282 |         alpha = 2. / (1. + math.exp(-20 * p)) - 1
283 | 
284 |         with autocast(enabled=hps.train.fp16_run):
285 |             # Generator
286 |             with autocast(enabled=False):
287 |                 aug_y_, aug_y_hat_last = aug.forward_sync(y_, y_hat_[-1])
288 |                 aug_y_hat_ = y_hat_
289 |                 aug_y_hat_[-1] = aug_y_hat_last
290 |             y_d_hat_r, y_d_hat_g, fmap_r, fmap_g = net_d(aug_y_, aug_y_hat_)
291 |             with autocast(enabled=False):
292 |                 loss_dur = torch.sum(l_length.float())
293 |                 loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
294 |                 loss_kl = kl_loss(z_p, logs_q, m_p, logs_p,
295 |                                   z_mask) * hps.train.c_kl
296 |                 loss_yin_dec = F.l1_loss(yin_gt_shifted_crop,
297 |                                          yin_dec_crop) * hps.train.c_yin
298 |                 
299 |                 ### add ### for frame check
300 |                 # _, _, gt_length = yin_gt_crop.shape
301 |                 # _, _, hat_length = yin_hat_crop.shape
302 |                 # if hat_length < gt_length:
303 |                 #     yin_gt_crop = yin_gt_crop[:,:,:hat_length]
304 |                 #     print(f"yin_gt : {gt_length} | yin_hat : {hat_length}")
305 |                 # elif hat_length > gt_length:
306 |                 #     yin_hat_crop = yin_hat_crop[:,:,:gt_length]
307 |                 #     print(f"yin_gt : {gt_length} | yin_hat : {hat_length}")
308 |                 ###########
309 |                 
310 |                 loss_yin_shift = F.l1_loss(
311 |                     torch.exp(-yin_gt_crop),
312 |                     torch.exp(-yin_hat_crop)) * hps.train.c_yin + F.l1_loss(
313 |                         torch.exp(-yin_hat_shifted),
314 |                         torch.exp(-(torch.chunk(yin_hat_crop, 2, dim=0)[1]))
315 |                     ) * hps.train.c_yin
316 |                 loss_fm = feature_loss(fmap_r, fmap_g)
317 |                 loss_gen, losses_gen = generator_loss(y_d_hat_g)
318 |                 loss_gen_all = loss_gen + loss_fm + loss_mel + loss_dur + loss_kl + loss_yin_shift + loss_yin_dec
319 |         optim_g.zero_grad()
320 |         scaler.scale(loss_gen_all).backward()
321 |         scaler.unscale_(optim_g)
322 |         grad_norm_g = commons.clip_grad_value_(net_g.parameters(), None)
323 |         scaler.step(optim_g)
324 |         scaler.update()
325 | 
326 |         if rank == 0:
327 |             inner_bar.update(1)
328 |             inner_bar.set_description(
329 |                 "Epoch {} | g {: .04f} d {: .04f}|".format(
330 |                     epoch, loss_gen_all, loss_disc_all))
331 |             if global_step % hps.train.log_interval == 0:
332 |                 lr = optim_g.param_groups[0]['lr']
333 | 
334 |                 scalar_dict = {
335 |                     "learning_rate": lr,
336 |                     "loss/g/score": sum(losses_gen),
337 |                     "loss/g/fm": loss_fm,
338 |                     "loss/g/mel": loss_mel,
339 |                     "loss/g/dur": loss_dur,
340 |                     "loss/g/kl": loss_kl,
341 |                     "loss/g/yindec": loss_yin_dec,
342 |                     "loss/g/yinshift": loss_yin_shift,
343 |                     "loss/g/total": loss_gen_all,
344 |                     "loss/d/real": sum(losses_disc_r),
345 |                     "loss/d/gen": sum(losses_disc_g),
346 |                     "loss/d/total": loss_disc_all,
347 |                 }
348 | 
349 |                 utils.summarize(writer=writer,
350 |                                 global_step=global_step,
351 |                                 scalars=scalar_dict)
352 |             if global_step % hps.train.eval_interval == 0:
353 |                 evaluate(hps, global_step, epoch, net_g, eval_loader, writer)
354 | 
355 |         global_step += 1
356 | 
357 |     if rank == 0:
358 |         if epoch % hps.train.save_interval == 0:
359 |             utils.save_checkpoint(
360 |                 net_g, optim_g, net_d, optim_d, hps, epoch,
361 |                 hps.train.learning_rate,
362 |                 os.path.join(hps.model_dir,
363 |                              "{}_{}.pth".format(hps.model_name, epoch)))
364 | 
365 | 
366 | def evaluate(hps, current_step, epoch, generator, eval_loader, writer):
367 |     generator.eval()
368 |     n_sample = hps.train.n_sample
369 |     with torch.no_grad():
370 |         loss_val_mel = 0
371 |         loss_val_yin = 0
372 |         val_bar = tqdm(total=len(eval_loader),
373 |                        desc="Validation (Step {})".format(current_step),
374 |                        position=1,
375 |                        leave=False)
376 |         for batch_idx, (x, x_lengths, spec, spec_lengths, ying, ying_lengths,
377 |                         y, y_lengths, speakers,
378 |                         tone) in enumerate(eval_loader):
379 |             x, x_lengths = x.cuda(0, non_blocking=True), x_lengths.cuda(
380 |                 0, non_blocking=True)
381 |             spec, spec_lengths = spec.cuda(
382 |                 0, non_blocking=True), spec_lengths.cuda(0, non_blocking=True)
383 |             ying, ying_lengths = ying.cuda(
384 |                 0, non_blocking=True), ying_lengths.cuda(0, non_blocking=True)
385 |             y, y_lengths = y.cuda(0, non_blocking=True), y_lengths.cuda(
386 |                 0, non_blocking=True)
387 |             speakers = speakers.cuda(0, non_blocking=True)
388 |             tone = tone.cuda(0, non_blocking=True)
389 | 
390 |             with autocast(enabled=hps.train.fp16_run):
391 |                 y_hat, l_length, attn, ids_slice, x_mask, z_mask, y_hat_, \
392 |                     (z, z_p, m_p, logs_p, m_q, logs_q),\
393 |                     _,\
394 |                     (z_spec, m_spec, logs_spec, spec_mask, z_yin, m_yin, logs_yin, yin_mask), \
395 |                     (yin_gt_crop, yin_gt_shifted_crop, yin_dec_crop, yin_hat_crop, scope_shift, yin_hat_shifted) \
396 |                     = generator.module(
397 |                         x, tone, x_lengths, spec, spec_lengths, ying, ying_lengths, speakers
398 |                     )
399 | 
400 |                 mel = spec_to_mel_torch(spec, hps.data.filter_length,
401 |                                         hps.data.n_mel_channels,
402 |                                         hps.data.sampling_rate,
403 |                                         hps.data.mel_fmin, hps.data.mel_fmax)
404 |                 y_mel = commons.slice_segments(
405 |                     mel, ids_slice,
406 |                     hps.train.segment_size // hps.data.hop_length)
407 |                 y_hat_mel = mel_spectrogram_torch(
408 |                     y_hat[-1].squeeze(1), hps.data.filter_length,
409 |                     hps.data.n_mel_channels, hps.data.sampling_rate,
410 |                     hps.data.hop_length, hps.data.win_length,
411 |                     hps.data.mel_fmin, hps.data.mel_fmax)
412 |                 with autocast(enabled=False):
413 |                     loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
414 |                     loss_val_mel += loss_mel.item()
415 |                     loss_yin = F.l1_loss(yin_gt_shifted_crop,
416 |                                          yin_dec_crop) * hps.train.c_yin
417 |                     loss_val_yin += loss_yin.item()
418 | 
419 |             if batch_idx == 0:
420 |                 x = x[:n_sample]
421 |                 x_lengths = x_lengths[:n_sample]
422 |                 spec = spec[:n_sample]
423 |                 spec_lengths = spec_lengths[:n_sample]
424 |                 ying = ying[:n_sample]
425 |                 ying_lengths = ying_lengths[:n_sample]
426 |                 y = y[:n_sample]
427 |                 y_lengths = y_lengths[:n_sample]
428 |                 speakers = speakers[:n_sample]
429 |                 tone = tone[:1]
430 | 
431 |                 decoder_inputs, _, mask, (z_crop, z, *_) \
432 |                     = generator.module.infer_pre_decoder(x, tone, x_lengths, speakers, max_len=2000)
433 |                 y_hat = generator.module.infer_decode_chunk(
434 |                     decoder_inputs, speakers)
435 | 
436 |                 #scope-shifted
437 |                 z_spec, z_yin = torch.split(z,
438 |                                             hps.model.inter_channels -
439 |                                             hps.model.yin_channels,
440 |                                             dim=1)
441 |                 z_yin_crop = generator.module.crop_scope([z_yin], 6)[0]
442 |                 z_crop_shift = torch.cat([z_spec, z_yin_crop], dim=1)
443 |                 decoder_inputs_shift = z_crop_shift * mask
444 |                 y_hat_shift = generator.module.infer_decode_chunk(
445 |                     decoder_inputs_shift, speakers)
446 |                 z_yin = z_yin * mask
447 |                 yin_hat = generator.module.yin_dec_infer(z_yin, mask, speakers)
448 | 
449 |                 y_hat_mel_length = mask.sum([1, 2]).long()
450 |                 y_hat_lengths = y_hat_mel_length * hps.data.hop_length
451 | 
452 |                 mel = spec_to_mel_torch(spec, hps.data.filter_length,
453 |                                         hps.data.n_mel_channels,
454 |                                         hps.data.sampling_rate,
455 |                                         hps.data.mel_fmin, hps.data.mel_fmax)
456 |                 y_hat_mel = mel_spectrogram_torch(
457 |                     y_hat.squeeze(1).float(), hps.data.filter_length,
458 |                     hps.data.n_mel_channels, hps.data.sampling_rate,
459 |                     hps.data.hop_length, hps.data.win_length,
460 |                     hps.data.mel_fmin, hps.data.mel_fmax)
461 |                 y_hat_shift_mel = mel_spectrogram_torch(
462 |                     y_hat_shift.squeeze(1).float(), hps.data.filter_length,
463 |                     hps.data.n_mel_channels, hps.data.sampling_rate,
464 |                     hps.data.hop_length, hps.data.win_length,
465 |                     hps.data.mel_fmin, hps.data.mel_fmax)
466 |                 y_hat_pad = F.pad(
467 |                     y_hat, (hps.data.filter_length - hps.data.hop_length,
468 |                             hps.data.filter_length - hps.data.hop_length +
469 |                             (-y_hat.shape[-1]) % hps.data.hop_length +
470 |                             hps.data.hop_length *
471 |                             (y_hat.shape[-1] % hps.data.hop_length == 0)),
472 |                     mode='reflect').squeeze(1)
473 |                 y_hat_shift_pad = F.pad(
474 |                     y_hat_shift,
475 |                     (hps.data.filter_length - hps.data.hop_length,
476 |                      hps.data.filter_length - hps.data.hop_length +
477 |                      (-y_hat.shape[-1]) % hps.data.hop_length +
478 |                      hps.data.hop_length *
479 |                      (y_hat.shape[-1] % hps.data.hop_length == 0)),
480 |                     mode='reflect').squeeze(1)
481 |                 ying_hat = generator.module.pitch.yingram(y_hat_pad)
482 |                 ying_hat_shift = generator.module.pitch.yingram(
483 |                     y_hat_shift_pad)
484 | 
485 |                 if y_hat_mel.size(2) < mel.size(2):
486 |                     zero = torch.full((n_sample, y_hat_mel.size(1),
487 |                                        mel.size(2) - y_hat_mel.size(2)),
488 |                                       -11.5129).to(y_hat_mel.device)
489 |                     y_hat_mel = torch.cat((y_hat_mel, zero), dim=2)
490 |                     y_hat_shift_mel = torch.cat((y_hat_shift_mel, zero), dim=2)
491 |                     zero = torch.full((n_sample, yin_hat.size(1),
492 |                                        mel.size(2) - yin_hat.size(2)),
493 |                                       0).to(y_hat_mel.device)
494 |                     yin_hat = torch.cat((yin_hat, zero), dim=2)
495 |                     zero = torch.full((n_sample, ying_hat.size(1),
496 |                                        mel.size(2) - ying_hat.size(2)),
497 |                                       0).to(y_hat_mel.device)
498 |                     ying_hat = torch.cat((ying_hat, zero), dim=2)
499 |                     ying_hat_shift = torch.cat((ying_hat_shift, zero), dim=2)
500 |                     zero = torch.full(
501 |                         (n_sample, z_yin.size(1), mel.size(2) - z_yin.size(2)),
502 |                         0).to(y_hat_mel.device)
503 |                     z_yin = torch.cat((z_yin, zero), dim=2)
504 | 
505 |                     ids = torch.arange(0, mel.size(2)).unsqueeze(0).expand(
506 |                         mel.size(1),
507 |                         -1).unsqueeze(0).expand(n_sample, -1,
508 |                                                 -1).to(y_hat_mel_length.device)
509 |                     mask = ids > y_hat_mel_length.unsqueeze(1).expand(
510 |                         -1, mel.size(1)).unsqueeze(2).expand(
511 |                             -1, -1, mel.size(2))
512 |                     y_hat_mel.masked_fill_(mask, -11.5129)
513 |                     y_hat_shift_mel.masked_fill_(mask, -11.5129)
514 | 
515 |                 image_dict = dict()
516 |                 audio_dict = dict()
517 |                 for i in range(n_sample):
518 |                     image_dict.update({
519 |                         "gen/{}_mel".format(i):
520 |                         utils.plot_spectrogram_to_numpy(
521 |                             y_hat_mel[i].cpu().numpy())
522 |                     })
523 |                     audio_dict.update({
524 |                         "gen/{}_audio".format(i):
525 |                         y_hat[i, :, :y_hat_lengths[i]]
526 |                     })
527 |                     image_dict.update({
528 |                         "gen/{}_mel_shift".format(i):
529 |                         utils.plot_spectrogram_to_numpy(
530 |                             y_hat_shift_mel[i].cpu().numpy())
531 |                     })
532 |                     audio_dict.update({
533 |                         "gen/{}_audio_shift".format(i):
534 |                         y_hat_shift[i, :, :y_hat_lengths[i]]
535 |                     })
536 |                     image_dict.update({
537 |                         "gen/{}_z_yin".format(i):
538 |                         utils.plot_spectrogram_to_numpy(z_yin[i].cpu().numpy())
539 |                     })
540 |                     image_dict.update({
541 |                         "gen/{}_yin_dec".format(i):
542 |                         utils.plot_spectrogram_to_numpy(
543 |                             yin_hat[i].cpu().numpy())
544 |                     })
545 |                     image_dict.update({
546 |                         "gen/{}_ying".format(i):
547 |                         utils.plot_spectrogram_to_numpy(
548 |                             ying_hat[i].cpu().numpy())
549 |                     })
550 |                     image_dict.update({
551 |                         "gen/{}_ying_shift".format(i):
552 |                         utils.plot_spectrogram_to_numpy(
553 |                             ying_hat_shift[i].cpu().numpy())
554 |                     })
555 | 
556 |                 if current_step == 0:
557 |                     for i in range(n_sample):
558 |                         image_dict.update({
559 |                             "gt/{}_mel".format(i):
560 |                             utils.plot_spectrogram_to_numpy(
561 |                                 mel[i].cpu().numpy())
562 |                         })
563 |                         image_dict.update({
564 |                             "gt/{}_ying".format(i):
565 |                             utils.plot_spectrogram_to_numpy(
566 |                                 ying[i].cpu().numpy())
567 |                         })
568 |                         audio_dict.update(
569 |                             {"gt/{}_audio".format(i): y[i, :, :y_lengths[i]]})
570 | 
571 |                 utils.summarize(writer=writer,
572 |                                 global_step=epoch,
573 |                                 images=image_dict,
574 |                                 audios=audio_dict,
575 |                                 audio_sampling_rate=hps.data.sampling_rate)
576 |             val_bar.update(1)
577 |         loss_val_mel = loss_val_mel / len(eval_loader)
578 |         loss_val_yin = loss_val_yin / len(eval_loader)
579 | 
580 |         scalar_dict = {
581 |             "loss/val/mel": loss_val_mel,
582 |             "loss/val/yin": loss_val_yin,
583 |         }
584 |         utils.summarize(writer=writer,
585 |                         global_step=current_step,
586 |                         scalars=scalar_dict)
587 |     generator.train()
588 | 
589 | 
590 | if __name__ == "__main__":
591 | 
592 |     ### add ###
593 |     # model_name = input("Enter your experimental model name...")
594 |     ###########
595 | 
596 |     parser = argparse.ArgumentParser()
597 |     parser.add_argument('-c',
598 |                         '--config',
599 |                         type=str,
600 |                         #default="./configs/default.yaml",
601 |                         default="./configs/config_ja_44100.yaml",       ### add ###
602 |                         help='Path to configuration file')
603 |     parser.add_argument('-m',
604 |                         '--model',
605 |                         type=str,
606 |                         default="test",
607 |                         #required=True,
608 |                         help='Model name')
609 |     parser.add_argument('-r',
610 |                         '--resume',
611 |                         type=str,
612 |                         help='Path to checkpoint for resume')
613 |     parser.add_argument('-f',
614 |                         '--force_resume',
615 |                         type=str,
616 |                         help='Path to checkpoint for force resume')
617 |     parser.add_argument('-t',
618 |                         '--transfer',
619 |                         type=str,
620 |                         help='Path to baseline checkpoint for transfer')
621 |     parser.add_argument('-w',
622 |                         '--ignore_warning',
623 |                         default=True,
624 |                         action="store_true",
625 |                         help='Ignore warning message')
626 |     parser.add_argument('-i',
627 |                         '--initial_run',
628 |                         action="store_true",
629 |                         help='Inintial run for saving pt files')
630 |     args = parser.parse_args()
631 |     if args.ignore_warning:
632 |         import warnings
633 |         warnings.filterwarnings(action='ignore')
634 | 
635 |     main(args)
636 | 


--------------------------------------------------------------------------------
/transforms.py:
--------------------------------------------------------------------------------
  1 | # from https://github.com/jaywalnut310/vits
  2 | import numpy as np
  3 | import torch
  4 | from torch.nn import functional as F
  5 | 
  6 | 
  7 | DEFAULT_MIN_BIN_WIDTH = 1e-3
  8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3
  9 | DEFAULT_MIN_DERIVATIVE = 1e-3
 10 | 
 11 | 
 12 | def piecewise_rational_quadratic_transform(
 13 |     inputs, 
 14 |     unnormalized_widths,
 15 |     unnormalized_heights,
 16 |     unnormalized_derivatives,
 17 |     inverse=False,
 18 |     tails=None, 
 19 |     tail_bound=1.,
 20 |     min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 21 |     min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 22 |     min_derivative=DEFAULT_MIN_DERIVATIVE
 23 | ):
 24 | 
 25 |     if tails is None:
 26 |         spline_fn = rational_quadratic_spline
 27 |         spline_kwargs = {}
 28 |     else:
 29 |         spline_fn = unconstrained_rational_quadratic_spline
 30 |         spline_kwargs = {
 31 |             'tails': tails,
 32 |             'tail_bound': tail_bound
 33 |         }
 34 | 
 35 |     outputs, logabsdet = spline_fn(
 36 |         inputs=inputs,
 37 |         unnormalized_widths=unnormalized_widths,
 38 |         unnormalized_heights=unnormalized_heights,
 39 |         unnormalized_derivatives=unnormalized_derivatives,
 40 |         inverse=inverse,
 41 |         min_bin_width=min_bin_width,
 42 |         min_bin_height=min_bin_height,
 43 |         min_derivative=min_derivative,
 44 |         **spline_kwargs
 45 |     )
 46 |     return outputs, logabsdet
 47 | 
 48 | 
 49 | def searchsorted(bin_locations, inputs, eps=1e-6):
 50 |     bin_locations[..., -1] += eps
 51 |     return torch.sum(
 52 |         inputs[..., None] >= bin_locations,
 53 |         dim=-1
 54 |     ) - 1
 55 | 
 56 | 
 57 | def unconstrained_rational_quadratic_spline(
 58 |     inputs,
 59 |     unnormalized_widths,
 60 |     unnormalized_heights,
 61 |     unnormalized_derivatives,
 62 |     inverse=False,
 63 |     tails='linear',
 64 |     tail_bound=1.,
 65 |     min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 66 |     min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 67 |     min_derivative=DEFAULT_MIN_DERIVATIVE
 68 | ):
 69 |     inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
 70 |     outside_interval_mask = ~inside_interval_mask
 71 | 
 72 |     outputs = torch.zeros_like(inputs)
 73 |     logabsdet = torch.zeros_like(inputs)
 74 | 
 75 |     if tails == 'linear':
 76 |         unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
 77 |         constant = np.log(np.exp(1 - min_derivative) - 1)
 78 |         unnormalized_derivatives[..., 0] = constant
 79 |         unnormalized_derivatives[..., -1] = constant
 80 | 
 81 |         outputs[outside_interval_mask] = inputs[outside_interval_mask]
 82 |         logabsdet[outside_interval_mask] = 0
 83 |     else:
 84 |         raise RuntimeError('{} tails are not implemented.'.format(tails))
 85 | 
 86 |     outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline(
 87 |         inputs=inputs[inside_interval_mask],
 88 |         unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
 89 |         unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
 90 |         unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
 91 |         inverse=inverse,
 92 |         left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
 93 |         min_bin_width=min_bin_width,
 94 |         min_bin_height=min_bin_height,
 95 |         min_derivative=min_derivative
 96 |     )
 97 | 
 98 |     return outputs, logabsdet
 99 | 
100 | def rational_quadratic_spline(
101 |     inputs,
102 |     unnormalized_widths,
103 |     unnormalized_heights,
104 |     unnormalized_derivatives,
105 |     inverse=False,
106 |     left=0., right=1., bottom=0., top=1.,
107 |     min_bin_width=DEFAULT_MIN_BIN_WIDTH,
108 |     min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
109 |     min_derivative=DEFAULT_MIN_DERIVATIVE
110 | ):
111 |     if torch.min(inputs) < left or torch.max(inputs) > right:
112 |         raise ValueError('Input to a transform is not within its domain')
113 | 
114 |     num_bins = unnormalized_widths.shape[-1]
115 | 
116 |     if min_bin_width * num_bins > 1.0:
117 |         raise ValueError('Minimal bin width too large for the number of bins')
118 |     if min_bin_height * num_bins > 1.0:
119 |         raise ValueError('Minimal bin height too large for the number of bins')
120 | 
121 |     widths = F.softmax(unnormalized_widths, dim=-1)
122 |     widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
123 |     cumwidths = torch.cumsum(widths, dim=-1)
124 |     cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0)
125 |     cumwidths = (right - left) * cumwidths + left
126 |     cumwidths[..., 0] = left
127 |     cumwidths[..., -1] = right
128 |     widths = cumwidths[..., 1:] - cumwidths[..., :-1]
129 | 
130 |     derivatives = min_derivative + F.softplus(unnormalized_derivatives)
131 | 
132 |     heights = F.softmax(unnormalized_heights, dim=-1)
133 |     heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
134 |     cumheights = torch.cumsum(heights, dim=-1)
135 |     cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0)
136 |     cumheights = (top - bottom) * cumheights + bottom
137 |     cumheights[..., 0] = bottom
138 |     cumheights[..., -1] = top
139 |     heights = cumheights[..., 1:] - cumheights[..., :-1]
140 | 
141 |     if inverse:
142 |         bin_idx = searchsorted(cumheights, inputs)[..., None]
143 |     else:
144 |         bin_idx = searchsorted(cumwidths, inputs)[..., None]
145 | 
146 |     input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
147 |     input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
148 | 
149 |     input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
150 |     delta = heights / widths
151 |     input_delta = delta.gather(-1, bin_idx)[..., 0]
152 | 
153 |     input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
154 |     input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
155 | 
156 |     input_heights = heights.gather(-1, bin_idx)[..., 0]
157 | 
158 |     if inverse:
159 |         a = (((inputs - input_cumheights) * (input_derivatives
160 |                                              + input_derivatives_plus_one
161 |                                              - 2 * input_delta)
162 |               + input_heights * (input_delta - input_derivatives)))
163 |         b = (input_heights * input_derivatives
164 |              - (inputs - input_cumheights) * (input_derivatives
165 |                                               + input_derivatives_plus_one
166 |                                               - 2 * input_delta))
167 |         c = - input_delta * (inputs - input_cumheights)
168 | 
169 |         discriminant = b.pow(2) - 4 * a * c
170 |         assert (discriminant >= 0).all()
171 | 
172 |         root = (2 * c) / (-b - torch.sqrt(discriminant))
173 |         outputs = root * input_bin_widths + input_cumwidths
174 | 
175 |         theta_one_minus_theta = root * (1 - root)
176 |         denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
177 |                                      * theta_one_minus_theta)
178 |         derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2)
179 |                                                      + 2 * input_delta * theta_one_minus_theta
180 |                                                      + input_derivatives * (1 - root).pow(2))
181 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
182 | 
183 |         return outputs, -logabsdet
184 |     else:
185 |         theta = (inputs - input_cumwidths) / input_bin_widths
186 |         theta_one_minus_theta = theta * (1 - theta)
187 | 
188 |         numerator = input_heights * (input_delta * theta.pow(2)
189 |                                      + input_derivatives * theta_one_minus_theta)
190 |         denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
191 |                                      * theta_one_minus_theta)
192 |         outputs = input_cumheights + numerator / denominator
193 | 
194 |         derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2)
195 |                                                      + 2 * input_delta * theta_one_minus_theta
196 |                                                      + input_derivatives * (1 - theta).pow(2))
197 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
198 | 
199 |         return outputs, logabsdet
200 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | # from https://github.com/jaywalnut310/vits
  2 | import os
  3 | import sys
  4 | import logging
  5 | import subprocess
  6 | import torch
  7 | import numpy as np
  8 | from omegaconf import OmegaConf
  9 | from scipy.io.wavfile import read
 10 | 
 11 | MATPLOTLIB_FLAG = False
 12 | 
 13 | logging.basicConfig(
 14 |     stream=sys.stdout, 
 15 |     level=logging.INFO, 
 16 |     format='[%(levelname)s|%(filename)s:%(lineno)s][%(asctime)s] >>> %(message)s'
 17 | )
 18 | logger = logging
 19 | 
 20 | 
 21 | def load_checkpoint(checkpoint_path, rank=0, model_g=None, model_d=None, optim_g=None, optim_d=None):
 22 |     assert os.path.isfile(checkpoint_path)
 23 |     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 24 |     iteration = checkpoint_dict['iteration']
 25 |     learning_rate = checkpoint_dict['learning_rate']
 26 |     config = checkpoint_dict['config']
 27 | 
 28 |     if model_g is not None:
 29 |         model_g, optim_g = load_model(
 30 |             model_g, 
 31 |             checkpoint_dict['model_g'],
 32 |             optim_g,
 33 |             checkpoint_dict['optimizer_g'])
 34 | 
 35 |     if model_d is not None:
 36 |         model_d, optim_d = load_model(
 37 |             model_d, 
 38 |             checkpoint_dict['model_d'],
 39 |             optim_d,
 40 |             checkpoint_dict['optimizer_d'])
 41 |     if rank == 0:
 42 |         logger.info(
 43 |             "Loaded checkpoint '{}' (iteration {})".format(
 44 |                 checkpoint_path, 
 45 |                 iteration
 46 |             )
 47 |         )
 48 |     return model_g, model_d, optim_g, optim_d, learning_rate, iteration, config
 49 |     
 50 | def load_checkpoint_diffsize(checkpoint_path, rank=0, model_g=None, model_d=None):
 51 |     assert os.path.isfile(checkpoint_path)
 52 |     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 53 |     iteration = checkpoint_dict['iteration']
 54 |     learning_rate = checkpoint_dict['learning_rate']
 55 |     config = checkpoint_dict['config']
 56 | 
 57 |     if model_g is not None:
 58 |         model_g = load_model_diffsize(
 59 |             model_g, 
 60 |             checkpoint_dict['model_g'])
 61 |     if model_d is not None:
 62 |         model_d = load_model_diffsize(
 63 |             model_d, 
 64 |             checkpoint_dict['model_d'])
 65 |     if rank == 0:
 66 |         logger.info(
 67 |             "Loaded checkpoint '{}' (iteration {})".format(
 68 |                 checkpoint_path, 
 69 |                 iteration
 70 |             )
 71 |         )
 72 |     del checkpoint_dict
 73 |     return model_g, model_d, learning_rate, iteration, config
 74 |  
 75 | def load_model_diffsize(model, model_state_dict):
 76 |     if hasattr(model, 'module'):
 77 |         state_dict = model.module.state_dict()
 78 |     else:
 79 |         state_dict = model.state_dict()
 80 | 
 81 |     for k, v in model_state_dict.items():
 82 |         if k in state_dict and state_dict[k].size() == v.size():
 83 |             state_dict[k] = v
 84 |     
 85 |     if hasattr(model, 'module'):
 86 |         model.module.load_state_dict(state_dict, strict=False)
 87 |     else:
 88 |         model.load_state_dict(state_dict, strict=False)
 89 |         
 90 |     return model
 91 | 
 92 | 
 93 | 
 94 | def load_model(model, model_state_dict, optim, optim_state_dict):
 95 |     if optim is not None:
 96 |         optim.load_state_dict(optim_state_dict)
 97 | 
 98 |     if hasattr(model, 'module'):
 99 |         state_dict = model.module.state_dict()
100 |     else:
101 |         state_dict = model.state_dict()
102 | 
103 |     for k, v in model_state_dict.items():
104 |         if k in state_dict and state_dict[k].size() == v.size():
105 |             state_dict[k] = v
106 |     
107 |     if hasattr(model, 'module'):
108 |         model.module.load_state_dict(state_dict)
109 |     else:
110 |         model.load_state_dict(state_dict)
111 |         
112 |     return model, optim
113 | 
114 | 
115 | def save_checkpoint(net_g, optim_g, net_d, optim_d, hps, epoch, learning_rate, save_path):
116 |     
117 |     def get_state_dict(model):
118 |         if hasattr(model, 'module'):
119 |             state_dict = model.module.state_dict()
120 |         else:
121 |             state_dict = model.state_dict()
122 |         return state_dict
123 |     
124 |     torch.save({'model_g': get_state_dict(net_g),
125 |                 'model_d': get_state_dict(net_d),
126 |                 'optimizer_g': optim_g.state_dict(),
127 |                 'optimizer_d': optim_d.state_dict(),
128 |                 'config': str(hps),
129 |                 'iteration': epoch,
130 |                 'learning_rate': learning_rate}, save_path)
131 | 
132 | 
133 | def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050):
134 |     for k, v in scalars.items():
135 |         writer.add_scalar(k, v, global_step)
136 |     for k, v in histograms.items():
137 |         writer.add_histogram(k, v, global_step)
138 |     for k, v in images.items():
139 |         writer.add_image(k, v, global_step, dataformats='HWC')
140 |     for k, v in audios.items():
141 |         writer.add_audio(k, v, global_step, audio_sampling_rate)
142 | 
143 | 
144 | def plot_spectrogram_to_numpy(spectrogram):
145 |     global MATPLOTLIB_FLAG
146 |     if not MATPLOTLIB_FLAG:
147 |         import matplotlib
148 |         matplotlib.use("Agg")
149 |         MATPLOTLIB_FLAG = True
150 |         mpl_logger = logging.getLogger('matplotlib')
151 |         mpl_logger.setLevel(logging.WARNING)
152 |     import matplotlib.pylab as plt
153 |     import numpy as np
154 | 
155 |     fig, ax = plt.subplots(figsize=(10, 2))
156 |     im = ax.imshow(spectrogram, aspect="auto", origin="lower",
157 |                    interpolation='none')
158 |     plt.colorbar(im, ax=ax)
159 |     plt.xlabel("Frames")
160 |     plt.ylabel("Channels")
161 |     plt.tight_layout()
162 | 
163 |     fig.canvas.draw()
164 |     data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
165 |     data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
166 |     plt.close()
167 |     return data
168 | 
169 | 
170 | def plot_alignment_to_numpy(alignment, info=None):
171 |     global MATPLOTLIB_FLAG
172 |     if not MATPLOTLIB_FLAG:
173 |         import matplotlib
174 |         matplotlib.use("Agg")
175 |         MATPLOTLIB_FLAG = True
176 |         mpl_logger = logging.getLogger('matplotlib')
177 |         mpl_logger.setLevel(logging.WARNING)
178 |     import matplotlib.pylab as plt
179 |     import numpy as np
180 | 
181 |     fig, ax = plt.subplots(figsize=(6, 4))
182 |     im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower',
183 |                    interpolation='none')
184 |     fig.colorbar(im, ax=ax)
185 |     xlabel = 'Decoder timestep'
186 |     if info is not None:
187 |         xlabel += '\n\n' + info
188 |     plt.xlabel(xlabel)
189 |     plt.ylabel('Encoder timestep')
190 |     plt.tight_layout()
191 | 
192 |     fig.canvas.draw()
193 |     data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
194 |     data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
195 |     plt.close()
196 |     return data
197 | 
198 | 
199 | def load_wav_to_torch(full_path):
200 |     sampling_rate, wav = read(full_path.replace("\\", "/")) ### modify .replace("\\", "/") ###
201 |     
202 |     if len(wav.shape) == 2:
203 |         wav = wav[:, 0]
204 | 
205 |     if wav.dtype == np.int16:
206 |         wav = wav / 32768.0
207 |     elif wav.dtype == np.int32:
208 |         wav = wav / 2147483648.0
209 |     elif wav.dtype == np.uint8:
210 |         wav = (wav - 128) / 128.0
211 |     wav = wav.astype(np.float32)
212 |     return torch.FloatTensor(wav), sampling_rate
213 | 
214 | 
215 | def load_filepaths_and_text(filename, split="|"):
216 |     with open(filename, encoding='utf-8') as f:
217 |         filepaths_and_text = [line.strip().split(split) for line in f]
218 |     return filepaths_and_text
219 | 
220 | 
221 | def get_hparams(args, init=True):
222 |     config = OmegaConf.load(args.config)
223 |     hparams = HParams(**config)
224 |     model_dir = "." + os.path.join(hparams.train.log_path, args.model)  ### add "." +  ###
225 | 
226 |     if not os.path.exists(model_dir):
227 |         os.makedirs(model_dir)
228 |     hparams.model_name = args.model
229 |     hparams.model_dir = model_dir
230 |     config_save_path = os.path.join(model_dir, "config.yaml")
231 | 
232 |     if init:
233 |         OmegaConf.save(config, config_save_path)
234 | 
235 |     return hparams
236 | 
237 | 
238 | def get_hparams_from_file(config_path):
239 |     config = OmegaConf.load(config_path)
240 |     hparams = HParams(**config)
241 |     return hparams
242 | 
243 | 
244 | def check_git_hash(model_dir):
245 |     source_dir = os.path.dirname(os.path.realpath(__file__))
246 |     if not os.path.exists(os.path.join(source_dir, ".git")):
247 |         logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format(
248 |             source_dir
249 |         ))
250 |         return
251 | 
252 |     cur_hash = subprocess.getoutput("git rev-parse HEAD")
253 | 
254 |     path = os.path.join(model_dir, "githash")
255 |     if os.path.exists(path):
256 |         saved_hash = open(path).read()
257 |         if saved_hash != cur_hash:
258 |             logger.warn("git hash values are different. {}(saved) != {}(current)".format(
259 |                 saved_hash[:8], cur_hash[:8]))
260 |     else:
261 |         open(path, "w").write(cur_hash)
262 | 
263 | 
264 | def get_logger(model_dir, filename="train.log"):
265 |     global logger
266 |     logger = logging.getLogger(os.path.basename(model_dir))
267 |     logger.setLevel(logging.DEBUG)
268 | 
269 |     formatter = logging.Formatter(
270 |         "%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s")
271 |     if not os.path.exists(model_dir):
272 |         os.makedirs(model_dir)
273 |     h = logging.FileHandler(os.path.join(model_dir, filename))
274 |     h.setLevel(logging.DEBUG)
275 |     h.setFormatter(formatter)
276 |     logger.addHandler(h)
277 |     return logger
278 | 
279 | 
280 | class HParams():
281 |     def __init__(self, **kwargs):
282 |         for k, v in kwargs.items():
283 |             if type(v) == dict:
284 |                 v = HParams(**v)
285 |             self[k] = v
286 | 
287 |     def keys(self):
288 |         return self.__dict__.keys()
289 | 
290 |     def items(self):
291 |         return self.__dict__.items()
292 | 
293 |     def values(self):
294 |         return self.__dict__.values()
295 | 
296 |     def __len__(self):
297 |         return len(self.__dict__)
298 | 
299 |     def __getitem__(self, key):
300 |         return getattr(self, key)
301 | 
302 |     def __setitem__(self, key, value):
303 |         return setattr(self, key, value)
304 | 
305 |     def __contains__(self, key):
306 |         return key in self.__dict__
307 | 
308 |     def __repr__(self):
309 |         return self.__dict__.__repr__()
310 | 


--------------------------------------------------------------------------------
/yin.py:
--------------------------------------------------------------------------------
  1 | # remove np from https://github.com/dhchoi99/NANSY/blob/master/models/yin.py
  2 | # adapted from https://github.com/patriceguyot/Yin
  3 | # https://github.com/NVIDIA/mellotron/blob/master/yin.py
  4 | 
  5 | import torch
  6 | import torch.nn.functional as F
  7 | from math import log2, ceil
  8 | 
  9 | 
 10 | def differenceFunction(x, N, tau_max):
 11 |     """
 12 |     Compute difference function of data x. This corresponds to equation (6) in [1]
 13 |     This solution is implemented directly with torch rfft.
 14 | 
 15 | 
 16 |     :param x: audio data (Tensor)
 17 |     :param N: length of data
 18 |     :param tau_max: integration window size
 19 |     :return: difference function
 20 |     :rtype: list
 21 |     """
 22 | 
 23 |     #x = np.array(x, np.float64) #[B,T]
 24 |     assert x.dim() == 2
 25 |     b, w = x.shape
 26 |     if w < tau_max:
 27 |         x = F.pad(x, (tau_max - w - (tau_max - w) // 2, (tau_max - w) // 2),
 28 |                   'constant',
 29 |                   mode='reflect')
 30 |     w = tau_max
 31 |     #x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
 32 |     x_cumsum = torch.cat(
 33 |         [torch.zeros([b, 1], device=x.device), (x * x).cumsum(dim=1)], dim=1)
 34 |     size = w + tau_max
 35 |     p2 = (size // 32).bit_length()
 36 |     #p2 = ceil(log2(size+1 // 32))
 37 |     nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
 38 |     size_pad = min(n * 2**p2 for n in nice_numbers if n * 2**p2 >= size)
 39 |     fc = torch.fft.rfft(x, size_pad)  #[B,F]
 40 |     conv = torch.fft.irfft(fc * fc.conj())[:, :tau_max]
 41 |     return x_cumsum[:, w:w - tau_max:
 42 |                     -1] + x_cumsum[:, w] - x_cumsum[:, :tau_max] - 2 * conv
 43 | 
 44 | 
 45 | def differenceFunction_np(x, N, tau_max):
 46 |     """
 47 |     Compute difference function of data x. This corresponds to equation (6) in [1]
 48 |     This solution is implemented directly with Numpy fft.
 49 | 
 50 | 
 51 |     :param x: audio data
 52 |     :param N: length of data
 53 |     :param tau_max: integration window size
 54 |     :return: difference function
 55 |     :rtype: list
 56 |     """
 57 | 
 58 |     x = np.array(x, np.float64)
 59 |     w = x.size
 60 |     tau_max = min(tau_max, w)
 61 |     x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
 62 |     size = w + tau_max
 63 |     p2 = (size // 32).bit_length()
 64 |     nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
 65 |     size_pad = min(x * 2**p2 for x in nice_numbers if x * 2**p2 >= size)
 66 |     fc = np.fft.rfft(x, size_pad)
 67 |     conv = np.fft.irfft(fc * fc.conjugate())[:tau_max]
 68 |     return x_cumsum[w:w -
 69 |                     tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv
 70 | 
 71 | 
 72 | def cumulativeMeanNormalizedDifferenceFunction(df, N, eps=1e-8):
 73 |     """
 74 |     Compute cumulative mean normalized difference function (CMND).
 75 | 
 76 |     This corresponds to equation (8) in [1]
 77 | 
 78 |     :param df: Difference function
 79 |     :param N: length of data
 80 |     :return: cumulative mean normalized difference function
 81 |     :rtype: list
 82 |     """
 83 |     #np.seterr(divide='ignore', invalid='ignore')
 84 |     # scipy method, assert df>0 for all element
 85 |     #   cmndf = df[1:] * np.asarray(list(range(1, N))) / (np.cumsum(df[1:]).astype(float) + eps)
 86 |     B, _ = df.shape
 87 |     cmndf = df[:,
 88 |                1:] * torch.arange(1, N, device=df.device, dtype=df.dtype).view(
 89 |                    1, -1) / (df[:, 1:].cumsum(dim=-1) + eps)
 90 |     return torch.cat(
 91 |         [torch.ones([B, 1], device=df.device, dtype=df.dtype), cmndf], dim=-1)
 92 | 
 93 | 
 94 | def differenceFunctionTorch(xs: torch.Tensor, N, tau_max) -> torch.Tensor:
 95 |     """pytorch backend batch-wise differenceFunction
 96 |     has 1e-4 level error with input shape of (32, 22050*1.5)
 97 |     Args:
 98 |         xs:
 99 |         N:
100 |         tau_max:
101 | 
102 |     Returns:
103 | 
104 |     """
105 |     xs = xs.double()
106 |     w = xs.shape[-1]
107 |     tau_max = min(tau_max, w)
108 |     zeros = torch.zeros((xs.shape[0], 1))
109 |     x_cumsum = torch.cat((torch.zeros((xs.shape[0], 1), device=xs.device),
110 |                           (xs * xs).cumsum(dim=-1, dtype=torch.double)),
111 |                          dim=-1)  # B x w
112 |     size = w + tau_max
113 |     p2 = (size // 32).bit_length()
114 |     nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
115 |     size_pad = min(x * 2**p2 for x in nice_numbers if x * 2**p2 >= size)
116 | 
117 |     fcs = torch.fft.rfft(xs, n=size_pad, dim=-1)
118 |     convs = torch.fft.irfft(fcs * fcs.conj())[:, :tau_max]
119 |     y1 = torch.flip(x_cumsum[:, w - tau_max + 1:w + 1], dims=[-1])
120 |     y = y1 + x_cumsum[:, w].unsqueeze(-1) - x_cumsum[:, :tau_max] - 2 * convs
121 |     return y
122 | 
123 | 
124 | def cumulativeMeanNormalizedDifferenceFunctionTorch(dfs: torch.Tensor,
125 |                                                     N,
126 |                                                     eps=1e-8) -> torch.Tensor:
127 |     arange = torch.arange(1, N, device=dfs.device, dtype=torch.float64)
128 |     cumsum = torch.cumsum(dfs[:, 1:], dim=-1,
129 |                           dtype=torch.float64).to(dfs.device)
130 | 
131 |     cmndfs = dfs[:, 1:] * arange / (cumsum + eps)
132 |     cmndfs = torch.cat(
133 |         (torch.ones(cmndfs.shape[0], 1, device=dfs.device), cmndfs), dim=-1)
134 |     return cmndfs
135 | 
136 | 
137 | if __name__ == '__main__':
138 |     wav = torch.randn(32, int(22050 * 1.5)).cuda()
139 |     wav_numpy = wav.detach().cpu().numpy()
140 |     x = wav_numpy[0]
141 | 
142 |     w_len = 2048
143 |     w_step = 256
144 |     tau_max = 2048
145 |     W = 2048
146 | 
147 |     startFrames = list(range(0, x.shape[-1] - w_len, w_step))
148 |     startFrames = np.asarray(startFrames)
149 |     # times = startFrames / sr
150 |     frames = [x[..., t:t + W] for t in startFrames]
151 |     frames = np.asarray(frames)
152 |     frames_torch = torch.from_numpy(frames).cuda()
153 | 
154 |     cmndfs0 = []
155 |     for idx, frame in enumerate(frames):
156 |         df = differenceFunction(frame, frame.shape[-1], tau_max)
157 |         cmndf = cumulativeMeanNormalizedDifferenceFunction(df, tau_max)
158 |         cmndfs0.append(cmndf)
159 |     cmndfs0 = np.asarray(cmndfs0)
160 | 
161 |     dfs = differenceFunctionTorch(frames_torch, frames_torch.shape[-1],
162 |                                   tau_max)
163 |     cmndfs1 = cumulativeMeanNormalizedDifferenceFunctionTorch(
164 |         dfs, tau_max).detach().cpu().numpy()
165 |     print(cmndfs0.shape, cmndfs1.shape)
166 |     print(np.sum(np.abs(cmndfs0 - cmndfs1)))
167 | 


--------------------------------------------------------------------------------