├── .gitignore
├── LICENSE
├── README.md
├── _config.yml
├── attentions.py
├── commons.py
├── configs
    ├── requirements.txt
    └── singing_base.json
├── data_utils.py
├── evaluate
    ├── evaluate_f0.py
    ├── evaluate_mcd.py
    ├── evaluate_semitone.py
    └── evaluate_vuv.py
├── evaluate_score.sh
├── filelists
    ├── singing_test.txt
    ├── singing_train.txt
    └── singing_valid.txt
├── losses.py
├── mel_processing.py
├── models.py
├── modules.py
├── normalize_wav.py
├── plot_f0.py
├── prepare
    ├── __init__.py
    ├── align_wav_spec.py
    ├── data_vits.py
    ├── data_vits_phn.py
    ├── data_vits_phn_ofuton.py
    ├── dur_to_frame.py
    ├── gen_ofuton_transcript.py
    ├── midi-HZ.scp
    ├── midi-note.scp
    ├── phone_map.py
    ├── phone_uv.py
    ├── preprocess.py
    ├── preprocess_jp.py
    ├── resample_wav.py
    └── resample_wav.sh
├── resource
    ├── 2005000151.wav
    ├── 2005000152.wav
    ├── 2006000186.wav
    ├── 2006000187.wav
    ├── 2008000268.wav
    ├── vising_loss.png
    └── vising_mel.png
├── train.py
├── train.sh
├── transforms.py
├── utils.py
├── vsinging_debug.py
├── vsinging_infer.py
├── vsinging_infer.txt
├── vsinging_infer_jp.py
├── vsinging_infer_jp.txt
├── vsinging_song.py
└── vsinging_song_midi.txt


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pth
 2 | *.pyc
 3 | filelists/singing_train.txt
 4 | filelists/singing_valid.txt
 5 | filelists/vits_file.txt
 6 | logs
 7 | singing_out
 8 | */*_res
 9 | *.zip
10 | nohup.out


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Init
 2 | Unofficial Implement of VISinger
 3 | 
 4 | # Reference Repos
 5 | https://github.com/jaywalnut310/vits
 6 | 
 7 | https://github.com/MoonInTheRiver/DiffSinger
 8 | 
 9 | https://wenet.org.cn/opencpop/
10 | 
11 | https://github.com/PlayVoice/VI-SVS
12 | 
13 | # Data Preprocess
14 | ```bash
15 | export PYTHONPATH=.
16 | ```
17 | 
18 | Generate ../VISinger_data/label_vits_phn/XXX._label.npy|XXX._label_dur.npy|XXX_score.npy|XXX_score_dur.npy|XXX_pitch.npy|XXX_slurs.npy
19 | 
20 | ```bash
21 | python prepare/data_vits_phn.py
22 | ```
23 | 
24 | Generate filelists/vits_file.txt
25 | Format: wave path|label path|label duration path|score path|score duration path|pitch path|slurs path;
26 | 
27 | ```bash
28 | python prepare/preprocess.py
29 | ```
30 | 
31 | # VISinger training
32 | 
33 | ```bash
34 | python train.py -c configs/singing_base.json -m singing_base
35 | ```
36 | 
37 | or
38 | 
39 | ```bash
40 | ./train.sh
41 | ```
42 | 
43 | # Inference
44 | 
45 | ```bash
46 | ./evaluate_score.sh
47 | ```
48 | 
49 | ![LOSS](/resource/vising_loss.png)
50 | ![MEL](/resource/vising_mel.png)
51 | 
52 | # Samples
53 | 
54 | <audio id="audio" controls="" preload="none">
55 |       <source id="wav" src="/resource/2005000151.wav">
56 | </audio>
57 | 
58 | <audio id="audio" controls="" preload="none">
59 |       <source id="wav" src="/resource/2005000152.wav">
60 | </audio>
61 | 
62 | <audio id="audio" controls="" preload="none">
63 |       <source id="wav" src="/resource/2005000186.wav">
64 | </audio>
65 | 
66 | <audio id="audio" controls="" preload="none">
67 |       <source id="wav" src="/resource/2005000187.wav">
68 | </audio>
69 | 
70 | <audio id="audio" controls="" preload="none">
71 |       <source id="wav" src="/resource/2005000268.wav">
72 | </audio>
73 | 
74 | 
75 | 
76 | 
77 | 
78 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | remote_theme: pages-themes/cayman@v0.2.0
2 | plugins:
3 | - jekyll-remote-theme # add this line to the plugins list if you already have one


--------------------------------------------------------------------------------
/attentions.py:
--------------------------------------------------------------------------------
  1 | import copy
  2 | import math
  3 | import numpy as np
  4 | import torch
  5 | from torch import nn
  6 | from torch.nn import functional as F
  7 | 
  8 | import commons
  9 | import modules
 10 | from modules import LayerNorm
 11 | 
 12 | 
 13 | class Encoder(nn.Module):
 14 |     def __init__(
 15 |         self,
 16 |         hidden_channels,
 17 |         filter_channels,
 18 |         n_heads,
 19 |         n_layers,
 20 |         kernel_size=1,
 21 |         p_dropout=0.0,
 22 |         window_size=10,
 23 |         **kwargs
 24 |     ):
 25 |         super().__init__()
 26 |         self.hidden_channels = hidden_channels
 27 |         self.filter_channels = filter_channels
 28 |         self.n_heads = n_heads
 29 |         self.n_layers = n_layers
 30 |         self.kernel_size = kernel_size
 31 |         self.p_dropout = p_dropout
 32 |         self.window_size = window_size
 33 | 
 34 |         self.drop = nn.Dropout(p_dropout)
 35 |         self.attn_layers = nn.ModuleList()
 36 |         self.norm_layers_1 = nn.ModuleList()
 37 |         self.ffn_layers = nn.ModuleList()
 38 |         self.norm_layers_2 = nn.ModuleList()
 39 |         for i in range(self.n_layers):
 40 |             self.attn_layers.append(
 41 |                 MultiHeadAttention(
 42 |                     hidden_channels,
 43 |                     hidden_channels,
 44 |                     n_heads,
 45 |                     p_dropout=p_dropout,
 46 |                     window_size=window_size,
 47 |                 )
 48 |             )
 49 |             self.norm_layers_1.append(LayerNorm(hidden_channels))
 50 |             self.ffn_layers.append(
 51 |                 FFN(
 52 |                     hidden_channels,
 53 |                     hidden_channels,
 54 |                     filter_channels,
 55 |                     kernel_size,
 56 |                     p_dropout=p_dropout,
 57 |                 )
 58 |             )
 59 |             self.norm_layers_2.append(LayerNorm(hidden_channels))
 60 | 
 61 |     def forward(self, x, x_mask):
 62 |         attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
 63 |         x = x * x_mask
 64 |         for i in range(self.n_layers):
 65 |             y = self.attn_layers[i](x, x, attn_mask)
 66 |             y = self.drop(y)
 67 |             x = self.norm_layers_1[i](x + y)
 68 | 
 69 |             y = self.ffn_layers[i](x, x_mask)
 70 |             y = self.drop(y)
 71 |             x = self.norm_layers_2[i](x + y)
 72 |         x = x * x_mask
 73 |         return x
 74 | 
 75 | 
 76 | class Decoder(nn.Module):
 77 |     def __init__(
 78 |         self,
 79 |         hidden_channels,
 80 |         filter_channels,
 81 |         n_heads,
 82 |         n_layers,
 83 |         kernel_size=1,
 84 |         p_dropout=0.0,
 85 |         proximal_bias=False,
 86 |         proximal_init=True,
 87 |         **kwargs
 88 |     ):
 89 |         super().__init__()
 90 |         self.hidden_channels = hidden_channels
 91 |         self.filter_channels = filter_channels
 92 |         self.n_heads = n_heads
 93 |         self.n_layers = n_layers
 94 |         self.kernel_size = kernel_size
 95 |         self.p_dropout = p_dropout
 96 |         self.proximal_bias = proximal_bias
 97 |         self.proximal_init = proximal_init
 98 | 
 99 |         self.drop = nn.Dropout(p_dropout)
100 |         self.self_attn_layers = nn.ModuleList()
101 |         self.norm_layers_0 = nn.ModuleList()
102 |         self.encdec_attn_layers = nn.ModuleList()
103 |         self.norm_layers_1 = nn.ModuleList()
104 |         self.ffn_layers = nn.ModuleList()
105 |         self.norm_layers_2 = nn.ModuleList()
106 |         for i in range(self.n_layers):
107 |             self.self_attn_layers.append(
108 |                 MultiHeadAttention(
109 |                     hidden_channels,
110 |                     hidden_channels,
111 |                     n_heads,
112 |                     p_dropout=p_dropout,
113 |                     proximal_bias=proximal_bias,
114 |                     proximal_init=proximal_init,
115 |                 )
116 |             )
117 |             self.norm_layers_0.append(LayerNorm(hidden_channels))
118 |             self.encdec_attn_layers.append(
119 |                 MultiHeadAttention(
120 |                     hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout
121 |                 )
122 |             )
123 |             self.norm_layers_1.append(LayerNorm(hidden_channels))
124 |             self.ffn_layers.append(
125 |                 FFN(
126 |                     hidden_channels,
127 |                     hidden_channels,
128 |                     filter_channels,
129 |                     kernel_size,
130 |                     p_dropout=p_dropout,
131 |                     causal=True,
132 |                 )
133 |             )
134 |             self.norm_layers_2.append(LayerNorm(hidden_channels))
135 | 
136 |     def forward(self, x, x_mask, h, h_mask):
137 |         """
138 |         x: decoder input
139 |         h: encoder output
140 |         """
141 |         self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(
142 |             device=x.device, dtype=x.dtype
143 |         )
144 |         encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
145 |         x = x * x_mask
146 |         for i in range(self.n_layers):
147 |             y = self.self_attn_layers[i](x, x, self_attn_mask)
148 |             y = self.drop(y)
149 |             x = self.norm_layers_0[i](x + y)
150 | 
151 |             y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
152 |             y = self.drop(y)
153 |             x = self.norm_layers_1[i](x + y)
154 | 
155 |             y = self.ffn_layers[i](x, x_mask)
156 |             y = self.drop(y)
157 |             x = self.norm_layers_2[i](x + y)
158 |         x = x * x_mask
159 |         return x
160 | 
161 | 
162 | class MultiHeadAttention(nn.Module):
163 |     def __init__(
164 |         self,
165 |         channels,
166 |         out_channels,
167 |         n_heads,
168 |         p_dropout=0.0,
169 |         window_size=None,
170 |         heads_share=True,
171 |         block_length=None,
172 |         proximal_bias=False,
173 |         proximal_init=False,
174 |     ):
175 |         super().__init__()
176 |         assert channels % n_heads == 0
177 | 
178 |         self.channels = channels
179 |         self.out_channels = out_channels
180 |         self.n_heads = n_heads
181 |         self.p_dropout = p_dropout
182 |         self.window_size = window_size
183 |         self.heads_share = heads_share
184 |         self.block_length = block_length
185 |         self.proximal_bias = proximal_bias
186 |         self.proximal_init = proximal_init
187 |         self.attn = None
188 | 
189 |         self.k_channels = channels // n_heads
190 |         self.conv_q = nn.Conv1d(channels, channels, 1)
191 |         self.conv_k = nn.Conv1d(channels, channels, 1)
192 |         self.conv_v = nn.Conv1d(channels, channels, 1)
193 |         self.conv_o = nn.Conv1d(channels, out_channels, 1)
194 |         self.drop = nn.Dropout(p_dropout)
195 | 
196 |         if window_size is not None:
197 |             n_heads_rel = 1 if heads_share else n_heads
198 |             rel_stddev = self.k_channels**-0.5
199 |             self.emb_rel_k = nn.Parameter(
200 |                 torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
201 |                 * rel_stddev
202 |             )
203 |             self.emb_rel_v = nn.Parameter(
204 |                 torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
205 |                 * rel_stddev
206 |             )
207 | 
208 |         nn.init.xavier_uniform_(self.conv_q.weight)
209 |         nn.init.xavier_uniform_(self.conv_k.weight)
210 |         nn.init.xavier_uniform_(self.conv_v.weight)
211 |         if proximal_init:
212 |             with torch.no_grad():
213 |                 self.conv_k.weight.copy_(self.conv_q.weight)
214 |                 self.conv_k.bias.copy_(self.conv_q.bias)
215 | 
216 |     def forward(self, x, c, attn_mask=None):
217 |         q = self.conv_q(x)
218 |         k = self.conv_k(c)
219 |         v = self.conv_v(c)
220 | 
221 |         x, self.attn = self.attention(q, k, v, mask=attn_mask)
222 | 
223 |         x = self.conv_o(x)
224 |         return x
225 | 
226 |     def attention(self, query, key, value, mask=None):
227 |         # reshape [b, d, t] -> [b, n_h, t, d_k]
228 |         b, d, t_s, t_t = (*key.size(), query.size(2))
229 |         query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
230 |         key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
231 |         value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
232 | 
233 |         scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1))
234 |         if self.window_size is not None:
235 |             assert (
236 |                 t_s == t_t
237 |             ), "Relative attention is only available for self-attention."
238 |             key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
239 |             rel_logits = self._matmul_with_relative_keys(
240 |                 query / math.sqrt(self.k_channels), key_relative_embeddings
241 |             )
242 |             scores_local = self._relative_position_to_absolute_position(rel_logits)
243 |             scores = scores + scores_local
244 |         if self.proximal_bias:
245 |             assert t_s == t_t, "Proximal bias is only available for self-attention."
246 |             scores = scores + self._attention_bias_proximal(t_s).to(
247 |                 device=scores.device, dtype=scores.dtype
248 |             )
249 |         if mask is not None:
250 |             scores = scores.masked_fill(mask == 0, -1e4)
251 |             if self.block_length is not None:
252 |                 assert (
253 |                     t_s == t_t
254 |                 ), "Local attention is only available for self-attention."
255 |                 block_mask = (
256 |                     torch.ones_like(scores)
257 |                     .triu(-self.block_length)
258 |                     .tril(self.block_length)
259 |                 )
260 |                 scores = scores.masked_fill(block_mask == 0, -1e4)
261 |         p_attn = F.softmax(scores, dim=-1)  # [b, n_h, t_t, t_s]
262 |         p_attn = self.drop(p_attn)
263 |         output = torch.matmul(p_attn, value)
264 |         if self.window_size is not None:
265 |             relative_weights = self._absolute_position_to_relative_position(p_attn)
266 |             value_relative_embeddings = self._get_relative_embeddings(
267 |                 self.emb_rel_v, t_s
268 |             )
269 |             output = output + self._matmul_with_relative_values(
270 |                 relative_weights, value_relative_embeddings
271 |             )
272 |         output = (
273 |             output.transpose(2, 3).contiguous().view(b, d, t_t)
274 |         )  # [b, n_h, t_t, d_k] -> [b, d, t_t]
275 |         return output, p_attn
276 | 
277 |     def _matmul_with_relative_values(self, x, y):
278 |         """
279 |         x: [b, h, l, m]
280 |         y: [h or 1, m, d]
281 |         ret: [b, h, l, d]
282 |         """
283 |         ret = torch.matmul(x, y.unsqueeze(0))
284 |         return ret
285 | 
286 |     def _matmul_with_relative_keys(self, x, y):
287 |         """
288 |         x: [b, h, l, d]
289 |         y: [h or 1, m, d]
290 |         ret: [b, h, l, m]
291 |         """
292 |         ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
293 |         return ret
294 | 
295 |     def _get_relative_embeddings(self, relative_embeddings, length):
296 |         max_relative_position = 2 * self.window_size + 1
297 |         # Pad first before slice to avoid using cond ops.
298 |         pad_length = max(length - (self.window_size + 1), 0)
299 |         slice_start_position = max((self.window_size + 1) - length, 0)
300 |         slice_end_position = slice_start_position + 2 * length - 1
301 |         if pad_length > 0:
302 |             padded_relative_embeddings = F.pad(
303 |                 relative_embeddings,
304 |                 commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]),
305 |             )
306 |         else:
307 |             padded_relative_embeddings = relative_embeddings
308 |         used_relative_embeddings = padded_relative_embeddings[
309 |             :, slice_start_position:slice_end_position
310 |         ]
311 |         return used_relative_embeddings
312 | 
313 |     def _relative_position_to_absolute_position(self, x):
314 |         """
315 |         x: [b, h, l, 2*l-1]
316 |         ret: [b, h, l, l]
317 |         """
318 |         batch, heads, length, _ = x.size()
319 |         # Concat columns of pad to shift from relative to absolute indexing.
320 |         x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
321 | 
322 |         # Concat extra elements so to add up to shape (len+1, 2*len-1).
323 |         x_flat = x.view([batch, heads, length * 2 * length])
324 |         x_flat = F.pad(
325 |             x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [0, length - 1]])
326 |         )
327 | 
328 |         # Reshape and slice out the padded elements.
329 |         x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[
330 |             :, :, :length, length - 1 :
331 |         ]
332 |         return x_final
333 | 
334 |     def _absolute_position_to_relative_position(self, x):
335 |         """
336 |         x: [b, h, l, l]
337 |         ret: [b, h, l, 2*l-1]
338 |         """
339 |         batch, heads, length, _ = x.size()
340 |         # padd along column
341 |         x = F.pad(
342 |             x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]])
343 |         )
344 |         x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
345 |         # add 0's in the beginning that will skew the elements after reshape
346 |         x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
347 |         x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
348 |         return x_final
349 | 
350 |     def _attention_bias_proximal(self, length):
351 |         """Bias for self-attention to encourage attention to close positions.
352 |         Args:
353 |           length: an integer scalar.
354 |         Returns:
355 |           a Tensor with shape [1, 1, length, length]
356 |         """
357 |         r = torch.arange(length, dtype=torch.float32)
358 |         diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
359 |         return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
360 | 
361 | 
362 | class FFN(nn.Module):
363 |     def __init__(
364 |         self,
365 |         in_channels,
366 |         out_channels,
367 |         filter_channels,
368 |         kernel_size,
369 |         p_dropout=0.0,
370 |         activation=None,
371 |         causal=False,
372 |     ):
373 |         super().__init__()
374 |         self.in_channels = in_channels
375 |         self.out_channels = out_channels
376 |         self.filter_channels = filter_channels
377 |         self.kernel_size = kernel_size
378 |         self.p_dropout = p_dropout
379 |         self.activation = activation
380 |         self.causal = causal
381 | 
382 |         if causal:
383 |             self.padding = self._causal_padding
384 |         else:
385 |             self.padding = self._same_padding
386 | 
387 |         self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
388 |         self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
389 |         self.drop = nn.Dropout(p_dropout)
390 | 
391 |     def forward(self, x, x_mask):
392 |         x = self.conv_1(self.padding(x * x_mask))
393 |         if self.activation == "gelu":
394 |             x = x * torch.sigmoid(1.702 * x)
395 |         else:
396 |             x = torch.relu(x)
397 |         x = self.drop(x)
398 |         x = self.conv_2(self.padding(x * x_mask))
399 |         return x * x_mask
400 | 
401 |     def _causal_padding(self, x):
402 |         if self.kernel_size == 1:
403 |             return x
404 |         pad_l = self.kernel_size - 1
405 |         pad_r = 0
406 |         padding = [[0, 0], [0, 0], [pad_l, pad_r]]
407 |         x = F.pad(x, commons.convert_pad_shape(padding))
408 |         return x
409 | 
410 |     def _same_padding(self, x):
411 |         if self.kernel_size == 1:
412 |             return x
413 |         pad_l = (self.kernel_size - 1) // 2
414 |         pad_r = self.kernel_size // 2
415 |         padding = [[0, 0], [0, 0], [pad_l, pad_r]]
416 |         x = F.pad(x, commons.convert_pad_shape(padding))
417 |         return x
418 | 


--------------------------------------------------------------------------------
/commons.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import numpy as np
  3 | import torch
  4 | from torch import nn
  5 | from torch.nn import functional as F
  6 | 
  7 | 
  8 | def init_weights(m, mean=0.0, std=0.01):
  9 |     classname = m.__class__.__name__
 10 |     if classname.find("Conv") != -1:
 11 |         m.weight.data.normal_(mean, std)
 12 | 
 13 | 
 14 | def get_padding(kernel_size, dilation=1):
 15 |     return int((kernel_size * dilation - dilation) / 2)
 16 | 
 17 | 
 18 | def convert_pad_shape(pad_shape):
 19 |     l = pad_shape[::-1]
 20 |     pad_shape = [item for sublist in l for item in sublist]
 21 |     return pad_shape
 22 | 
 23 | 
 24 | def kl_divergence(m_p, logs_p, m_q, logs_q):
 25 |     """KL(P||Q)"""
 26 |     kl = (logs_q - logs_p) - 0.5
 27 |     kl += (
 28 |         0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
 29 |     )
 30 |     return kl
 31 | 
 32 | 
 33 | def rand_gumbel(shape):
 34 |     """Sample from the Gumbel distribution, protect from overflows."""
 35 |     uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
 36 |     return -torch.log(-torch.log(uniform_samples))
 37 | 
 38 | 
 39 | def rand_gumbel_like(x):
 40 |     g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
 41 |     return g
 42 | 
 43 | 
 44 | def slice_segments(x, ids_str, segment_size=4):
 45 |     ret = torch.zeros_like(x[:, :, :segment_size])
 46 |     for i in range(x.size(0)):
 47 |         idx_str = ids_str[i]
 48 |         idx_end = idx_str + segment_size
 49 |         ret[i] = x[i, :, idx_str:idx_end]
 50 |     return ret
 51 | 
 52 | 
 53 | def rand_slice_segments(x, x_lengths=None, segment_size=4):
 54 |     b, d, t = x.size()
 55 |     if x_lengths is None:
 56 |         x_lengths = t
 57 |     ids_str_max = x_lengths - segment_size + 1
 58 |     ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
 59 |     ret = slice_segments(x, ids_str, segment_size)
 60 |     return ret, ids_str
 61 | 
 62 | 
 63 | def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
 64 |     position = torch.arange(length, dtype=torch.float)
 65 |     num_timescales = channels // 2
 66 |     log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
 67 |         num_timescales - 1
 68 |     )
 69 |     inv_timescales = min_timescale * torch.exp(
 70 |         torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
 71 |     )
 72 |     scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
 73 |     signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
 74 |     signal = F.pad(signal, [0, 0, 0, channels % 2])
 75 |     signal = signal.view(1, channels, length)
 76 |     return signal
 77 | 
 78 | 
 79 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
 80 |     b, channels, length = x.size()
 81 |     signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
 82 |     return x + signal.to(dtype=x.dtype, device=x.device)
 83 | 
 84 | 
 85 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
 86 |     b, channels, length = x.size()
 87 |     signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
 88 |     return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
 89 | 
 90 | 
 91 | def subsequent_mask(length):
 92 |     mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
 93 |     return mask
 94 | 
 95 | 
 96 | @torch.jit.script
 97 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
 98 |     n_channels_int = n_channels[0]
 99 |     in_act = input_a + input_b
100 |     t_act = torch.tanh(in_act[:, :n_channels_int, :])
101 |     s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
102 |     acts = t_act * s_act
103 |     return acts
104 | 
105 | 
106 | def convert_pad_shape(pad_shape):
107 |     l = pad_shape[::-1]
108 |     pad_shape = [item for sublist in l for item in sublist]
109 |     return pad_shape
110 | 
111 | 
112 | def shift_1d(x):
113 |     x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
114 |     return x
115 | 
116 | 
117 | def sequence_mask(length, max_length=None):
118 |     if max_length is None:
119 |         max_length = length.max()
120 |     x = torch.arange(max_length, dtype=length.dtype, device=length.device)
121 |     return x.unsqueeze(0) < length.unsqueeze(1)
122 | 
123 | 
124 | def generate_path(duration, mask):
125 |     """
126 |     duration: [b, 1, t_x]
127 |     mask: [b, 1, t_y, t_x]
128 |     """
129 |     device = duration.device
130 | 
131 |     b, _, t_y, t_x = mask.shape
132 |     cum_duration = torch.cumsum(duration, -1)
133 | 
134 |     cum_duration_flat = cum_duration.view(b * t_x)
135 |     path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
136 |     path = path.view(b, t_x, t_y)
137 |     path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
138 |     path = path.unsqueeze(1).transpose(2, 3) * mask
139 |     return path
140 | 
141 | 
142 | def clip_grad_value_(parameters, clip_value, norm_type=2):
143 |     if isinstance(parameters, torch.Tensor):
144 |         parameters = [parameters]
145 |     parameters = list(filter(lambda p: p.grad is not None, parameters))
146 |     norm_type = float(norm_type)
147 |     if clip_value is not None:
148 |         clip_value = float(clip_value)
149 | 
150 |     total_norm = 0
151 |     for p in parameters:
152 |         param_norm = p.grad.data.norm(norm_type)
153 |         total_norm += param_norm.item() ** norm_type
154 |         if clip_value is not None:
155 |             p.grad.data.clamp_(min=-clip_value, max=clip_value)
156 |     total_norm = total_norm ** (1.0 / norm_type)
157 |     return total_norm
158 | 


--------------------------------------------------------------------------------
/configs/requirements.txt:
--------------------------------------------------------------------------------
 1 | Cython==0.29.21
 2 | librosa==0.8.0
 3 | matplotlib==3.3.1
 4 | numpy==1.18.5
 5 | phonemizer==2.2.1
 6 | scipy==1.5.2
 7 | tensorboard==2.3.0
 8 | torch==1.6.0
 9 | torchvision==0.7.0
10 | Unidecode==1.1.1
11 | 


--------------------------------------------------------------------------------
/configs/singing_base.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "train": {
 3 |     "log_interval": 200,
 4 |     "eval_interval": 2000,
 5 |     "seed": 1234,
 6 |     "epochs": 20000,
 7 |     "learning_rate": 1e-4,
 8 |     "betas": [
 9 |       0.8,
10 |       0.99
11 |     ],
12 |     "eps": 1e-9,
13 |     "batch_size": 6,
14 |     "fp16_run": false,
15 |     "lr_decay": 0.999875,
16 |     "segment_size": 8192,
17 |     "init_lr_ratio": 1,
18 |     "warmup_epochs": 0,
19 |     "c_mel": 45,
20 |     "c_kl": 1.0,
21 |     "keep_n_models": 20
22 |   },
23 |   "data": {
24 |     "training_files": "filelists/singing_train.txt",
25 |     "validation_files": "filelists/singing_valid.txt",
26 |     "max_wav_value": 32768.0,
27 |     "sampling_rate": 24000,
28 |     "filter_length": 1024,
29 |     "hop_length": 256,
30 |     "win_length": 1024,
31 |     "n_mel_channels": 80,
32 |     "mel_fmin": 0.0,
33 |     "mel_fmax": null,
34 |     "n_speakers": 0
35 |   },
36 |   "model": {
37 |     "inter_channels": 192,
38 |     "hidden_channels": 192,
39 |     "filter_channels": 768,
40 |     "n_heads": 2,
41 |     "n_layers": 6,
42 |     "kernel_size": 3,
43 |     "p_dropout": 0.1,
44 |     "resblock": "1",
45 |     "resblock_kernel_sizes": [
46 |       3,
47 |       7,
48 |       11
49 |     ],
50 |     "resblock_dilation_sizes": [
51 |       [
52 |         1,
53 |         3,
54 |         5
55 |       ],
56 |       [
57 |         1,
58 |         3,
59 |         5
60 |       ],
61 |       [
62 |         1,
63 |         3,
64 |         5
65 |       ]
66 |     ],
67 |     "upsample_rates": [
68 |       8,
69 |       8,
70 |       2,
71 |       2
72 |     ],
73 |     "upsample_initial_channel": 384,
74 |     "upsample_kernel_sizes": [
75 |       16,
76 |       16,
77 |       4,
78 |       4
79 |     ],
80 |     "use_spectral_norm": false
81 |   }
82 | }


--------------------------------------------------------------------------------
/data_utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import torch
  4 | import torch.utils.data
  5 | 
  6 | from mel_processing import spectrogram_torch
  7 | from utils import load_wav_to_torch, load_filepaths_and_text
  8 | import scipy.io.wavfile as sciwav
  9 | 
 10 | 
 11 | class TextAudioLoader(torch.utils.data.Dataset):
 12 |     """
 13 |     1) loads audio, text pairs
 14 |     2) normalizes text and converts them to sequences of integers
 15 |     3) computes spectrograms from audio files.
 16 |     """
 17 | 
 18 |     def __init__(self, audiopaths_and_text, hparams):
 19 |         self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text)
 20 |         self.max_wav_value = hparams.max_wav_value
 21 |         self.sampling_rate = hparams.sampling_rate
 22 |         self.filter_length = hparams.filter_length
 23 |         self.hop_length = hparams.hop_length
 24 |         self.win_length = hparams.win_length
 25 |         self.sampling_rate = hparams.sampling_rate
 26 |         self.min_text_len = getattr(hparams, "min_text_len", 1)
 27 |         self.max_text_len = getattr(hparams, "max_text_len", 5000)
 28 |         self._filter()
 29 | 
 30 |     def _filter(self):
 31 |         """
 32 |         Filter text & store spec lengths
 33 |         """
 34 |         # Store spectrogram lengths for Bucketing
 35 |         # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2)
 36 |         # spec_length = wav_length // hop_length
 37 |         audiopaths_and_text_new = []
 38 |         lengths = []
 39 | 
 40 |         for (
 41 |             audiopath,
 42 |             text,
 43 |             text_dur,
 44 |             score,
 45 |             score_dur,
 46 |             pitch,
 47 |             slur,
 48 |         ) in self.audiopaths_and_text:
 49 |             if self.min_text_len <= len(text) and len(text) <= self.max_text_len:
 50 |                 audiopaths_and_text_new.append(
 51 |                     [audiopath, text, text_dur, score, score_dur, pitch, slur]
 52 |                 )
 53 |                 lengths.append(os.path.getsize(audiopath) // (2 * self.hop_length))
 54 |         self.audiopaths_and_text = audiopaths_and_text_new
 55 |         self.lengths = lengths
 56 | 
 57 |     def get_audio_text_pair(self, audiopath_and_text):
 58 |         # separate filename and text
 59 |         file = audiopath_and_text[0]
 60 |         phone = audiopath_and_text[1]
 61 |         phone_dur = audiopath_and_text[2]
 62 |         score = audiopath_and_text[3]
 63 |         score_dur = audiopath_and_text[4]
 64 |         pitch = audiopath_and_text[5]
 65 |         slurs = audiopath_and_text[6]
 66 | 
 67 |         phone, phone_dur, score, score_dur, pitch, slurs = self.get_labels(
 68 |             phone, phone_dur, score, score_dur, pitch, slurs
 69 |         )
 70 |         spec, wav = self.get_audio(file, phone_dur)
 71 | 
 72 |         len_phone = phone.size()[0]
 73 |         len_spec = spec.size()[-1]
 74 | 
 75 |         if len_phone != len_spec:
 76 |             # print("**************CareFull*******************")
 77 |             # print(f"filepath={audiopath_and_text[0]}")
 78 |             # print(f"len_text={len_phone}")
 79 |             # print(f"len_spec={len_spec}")
 80 |             if len_phone > len_spec:
 81 |                 print(file)
 82 |                 print("len_phone", len_phone)
 83 |                 print("len_spec", len_spec)
 84 |             assert len_phone < len_spec
 85 |             # len_min = min(len_phone, len_spec)
 86 |             # amor hop_size=256
 87 |             len_wav = len_spec * self.hop_length
 88 |             # print(wav.size())
 89 |             # print(f"len_min={len_min}")
 90 |             # print(f"len_wav={len_wav}")
 91 |             # spec = spec[:, :len_min]
 92 |             wav = wav[:, :len_wav]
 93 |         return (phone, phone_dur, score, score_dur, pitch, slurs, spec, wav)
 94 | 
 95 |     def get_labels(self, phone, phone_dur, score, score_dur, pitch, slurs):
 96 |         phone = np.load(phone)
 97 |         phone_dur = np.load(phone_dur)
 98 |         score = np.load(score)
 99 |         score_dur = np.load(score_dur)
100 |         pitch = np.load(pitch)
101 |         slurs = np.load(slurs)
102 |         phone = torch.LongTensor(phone)
103 |         phone_dur = torch.LongTensor(phone_dur)
104 |         score = torch.LongTensor(score)
105 |         score_dur = torch.LongTensor(score_dur)
106 |         pitch = torch.FloatTensor(pitch)
107 |         slurs = torch.LongTensor(slurs)
108 |         return phone, phone_dur, score, score_dur, pitch, slurs
109 | 
110 |     def get_audio(self, filename, phone_dur):
111 |         audio, sampling_rate = load_wav_to_torch(filename)
112 |         if sampling_rate != self.sampling_rate:
113 |             raise ValueError(
114 |                 "{} {} SR doesn't match target {} SR".format(
115 |                     filename, sampling_rate, self.sampling_rate
116 |                 )
117 |             )
118 |         audio_norm = audio / self.max_wav_value
119 |         audio_norm = audio_norm.unsqueeze(0)
120 |         spec_filename = filename.replace(".wav", ".spec.pt")
121 |         if os.path.exists(spec_filename):
122 |             spec = torch.load(spec_filename)
123 |         else:
124 |             print("please run data_vits_phn.py first")
125 |             assert FileExistsError
126 |         # else:
127 |         #     spec = spectrogram_torch(
128 |         #         audio_norm,
129 |         #         self.filter_length,
130 |         #         self.sampling_rate,
131 |         #         self.hop_length,
132 |         #         self.win_length,
133 |         #         center=False,
134 |         #     )
135 |         #     # align mel and wave
136 |         #     phone_dur_sum = torch.sum(phone_dur).item()
137 |         #     spec_length = spec.shape[2]
138 | 
139 |         #     if spec_length > phone_dur_sum:
140 |         #         spec = spec[:, :, :phone_dur_sum]
141 |         #     elif spec_length < phone_dur_sum:
142 |         #         pad_length = phone_dur_sum - spec_length
143 |         #         spec = torch.nn.functional.pad(
144 |         #             input=spec, pad=(0, pad_length, 0, 0), mode="constant", value=0
145 |         #         )
146 |         #     assert spec.shape[2] == phone_dur_sum
147 | 
148 |         #     # align wav
149 |         #     fixed_wav_len = phone_dur_sum * self.hop_length
150 |         #     if audio_norm.shape[1] > fixed_wav_len:
151 |         #         audio_norm = audio_norm[:, :fixed_wav_len]
152 |         #     elif audio_norm.shape[1] < fixed_wav_len:
153 |         #         pad_length = fixed_wav_len - audio_norm.shape[1]
154 |         #         audio_norm = torch.nn.functional.pad(
155 |         #             input=audio_norm,
156 |         #             pad=(0, pad_length, 0, 0),
157 |         #             mode="constant",
158 |         #             value=0,
159 |         #         )
160 |         #     assert audio_norm.shape[1] == fixed_wav_len
161 | 
162 |         #     # rewrite aligned wav
163 |         #     audio = (audio_norm * self.max_wav_value).transpose(0, 1).numpy().astype(np.int16)
164 | 
165 |         #     sciwav.write(
166 |         #         filename,
167 |         #         self.sampling_rate,
168 |         #         audio,
169 |         #     )
170 |         #     # save spec
171 |         #     spec = torch.squeeze(spec, 0)
172 |         #     torch.save(spec, spec_filename)
173 |         return spec, audio_norm
174 | 
175 |     def __getitem__(self, index):
176 |         return self.get_audio_text_pair(self.audiopaths_and_text[index])
177 | 
178 |     def __len__(self):
179 |         return len(self.audiopaths_and_text)
180 | 
181 | 
182 | class TextAudioCollate:
183 |     """Zero-pads model inputs and targets"""
184 | 
185 |     def __init__(self, return_ids=False):
186 |         self.return_ids = return_ids
187 | 
188 |     def __call__(self, batch):
189 |         """Collate's training batch from normalized text and aduio
190 |         PARAMS
191 |         ------
192 |         batch: [text_normalized, spec_normalized, wav_normalized]
193 |         """
194 |         # Right zero-pad all one-hot text sequences to max input length
195 |         _, ids_sorted_decreasing = torch.sort(
196 |             torch.LongTensor([x[6].size(1) for x in batch]), dim=0, descending=True
197 |         )
198 | 
199 |         max_phone_len = max([len(x[0]) for x in batch])
200 |         max_spec_len = max([x[6].size(1) for x in batch])
201 |         max_wave_len = max([x[7].size(1) for x in batch])
202 | 
203 |         phone_lengths = torch.LongTensor(len(batch))
204 |         phone_padded = torch.LongTensor(len(batch), max_phone_len)
205 |         phone_dur_padded = torch.LongTensor(len(batch), max_phone_len)
206 |         score_padded = torch.LongTensor(len(batch), max_phone_len)
207 |         score_dur_padded = torch.LongTensor(len(batch), max_phone_len)
208 |         pitch_padded = torch.FloatTensor(len(batch), max_spec_len)
209 |         slurs_padded = torch.LongTensor(len(batch), max_phone_len)
210 |         phone_padded.zero_()
211 |         phone_dur_padded.zero_()
212 |         score_padded.zero_()
213 |         score_dur_padded.zero_()
214 |         pitch_padded.zero_()
215 |         slurs_padded.zero_()
216 | 
217 |         spec_lengths = torch.LongTensor(len(batch))
218 |         wave_lengths = torch.LongTensor(len(batch))
219 |         spec_padded = torch.FloatTensor(len(batch), batch[0][6].size(0), max_spec_len)
220 |         wave_padded = torch.FloatTensor(len(batch), 1, max_wave_len)
221 |         spec_padded.zero_()
222 |         wave_padded.zero_()
223 | 
224 |         for i in range(len(ids_sorted_decreasing)):
225 |             row = batch[ids_sorted_decreasing[i]]
226 | 
227 |             phone = row[0]
228 |             phone_padded[i, : phone.size(0)] = phone
229 |             phone_lengths[i] = phone.size(0)
230 | 
231 |             phone_dur = row[1]
232 |             phone_dur_padded[i, : phone_dur.size(0)] = phone_dur
233 | 
234 |             score = row[2]
235 |             score_padded[i, : score.size(0)] = score
236 | 
237 |             score_dur = row[3]
238 |             score_dur_padded[i, : score_dur.size(0)] = score_dur
239 | 
240 |             pitch = row[4]
241 |             pitch_padded[i, : pitch.size(0)] = pitch
242 | 
243 |             slurs = row[5]
244 |             slurs_padded[i, : slurs.size(0)] = slurs
245 | 
246 |             spec = row[6]
247 |             spec_padded[i, :, : spec.size(1)] = spec
248 |             spec_lengths[i] = spec.size(1)
249 | 
250 |             wave = row[7]
251 |             wave_padded[i, :, : wave.size(1)] = wave
252 |             wave_lengths[i] = wave.size(1)
253 | 
254 |         return (
255 |             phone_padded,
256 |             phone_lengths,
257 |             phone_dur_padded,
258 |             score_padded,
259 |             score_dur_padded,
260 |             pitch_padded,
261 |             slurs_padded,
262 |             spec_padded,
263 |             spec_lengths,
264 |             wave_padded,
265 |             wave_lengths,
266 |         )
267 | 
268 | 
269 | class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler):
270 |     """
271 |     Maintain similar input lengths in a batch.
272 |     Length groups are specified by boundaries.
273 |     Ex) boundaries = [b1, b2, b3] -> any batch is included either {x | b1 < length(x) <=b2} or {x | b2 < length(x) <= b3}.
274 | 
275 |     It removes samples which are not included in the boundaries.
276 |     Ex) boundaries = [b1, b2, b3] -> any x s.t. length(x) <= b1 or length(x) > b3 are discarded.
277 |     """
278 | 
279 |     def __init__(
280 |         self,
281 |         dataset,
282 |         batch_size,
283 |         boundaries,
284 |         num_replicas=None,
285 |         rank=None,
286 |         shuffle=True,
287 |     ):
288 |         super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
289 |         self.lengths = dataset.lengths
290 |         self.batch_size = batch_size
291 |         self.boundaries = boundaries
292 | 
293 |         self.buckets, self.num_samples_per_bucket = self._create_buckets()
294 |         self.total_size = sum(self.num_samples_per_bucket)
295 |         self.num_samples = self.total_size // self.num_replicas
296 | 
297 |     def _create_buckets(self):
298 |         buckets = [[] for _ in range(len(self.boundaries) - 1)]
299 |         for i in range(len(self.lengths)):
300 |             length = self.lengths[i]
301 |             idx_bucket = self._bisect(length)
302 |             if idx_bucket != -1:
303 |                 buckets[idx_bucket].append(i)
304 | 
305 |         for i in range(len(buckets) - 1, 0, -1):
306 |             if len(buckets[i]) == 0:
307 |                 buckets.pop(i)
308 |                 self.boundaries.pop(i + 1)
309 | 
310 |         num_samples_per_bucket = []
311 |         for i in range(len(buckets)):
312 |             len_bucket = len(buckets[i])
313 |             total_batch_size = self.num_replicas * self.batch_size
314 |             rem = (
315 |                 total_batch_size - (len_bucket % total_batch_size)
316 |             ) % total_batch_size
317 |             num_samples_per_bucket.append(len_bucket + rem)
318 |         return buckets, num_samples_per_bucket
319 | 
320 |     def __iter__(self):
321 |         # deterministically shuffle based on epoch
322 |         g = torch.Generator()
323 |         g.manual_seed(self.epoch)
324 | 
325 |         indices = []
326 |         if self.shuffle:
327 |             for bucket in self.buckets:
328 |                 indices.append(torch.randperm(len(bucket), generator=g).tolist())
329 |         else:
330 |             for bucket in self.buckets:
331 |                 indices.append(list(range(len(bucket))))
332 | 
333 |         batches = []
334 |         for i in range(len(self.buckets)):
335 |             bucket = self.buckets[i]
336 |             len_bucket = len(bucket)
337 |             ids_bucket = indices[i]
338 |             num_samples_bucket = self.num_samples_per_bucket[i]
339 | 
340 |             # add extra samples to make it evenly divisible
341 |             rem = num_samples_bucket - len_bucket
342 |             ids_bucket = (
343 |                 ids_bucket
344 |                 + ids_bucket * (rem // len_bucket)
345 |                 + ids_bucket[: (rem % len_bucket)]
346 |             )
347 | 
348 |             # subsample
349 |             ids_bucket = ids_bucket[self.rank :: self.num_replicas]
350 | 
351 |             # batching
352 |             for j in range(len(ids_bucket) // self.batch_size):
353 |                 batch = [
354 |                     bucket[idx]
355 |                     for idx in ids_bucket[
356 |                         j * self.batch_size : (j + 1) * self.batch_size
357 |                     ]
358 |                 ]
359 |                 batches.append(batch)
360 | 
361 |         if self.shuffle:
362 |             batch_ids = torch.randperm(len(batches), generator=g).tolist()
363 |             batches = [batches[i] for i in batch_ids]
364 |         self.batches = batches
365 | 
366 |         assert len(self.batches) * self.batch_size == self.num_samples
367 |         return iter(self.batches)
368 | 
369 |     def _bisect(self, x, lo=0, hi=None):
370 |         if hi is None:
371 |             hi = len(self.boundaries) - 1
372 | 
373 |         if hi > lo:
374 |             mid = (hi + lo) // 2
375 |             if self.boundaries[mid] < x and x <= self.boundaries[mid + 1]:
376 |                 return mid
377 |             elif x <= self.boundaries[mid]:
378 |                 return self._bisect(x, lo, mid)
379 |             else:
380 |                 return self._bisect(x, mid + 1, hi)
381 |         else:
382 |             return -1
383 | 
384 |     def __len__(self):
385 |         return self.num_samples // self.batch_size
386 | 


--------------------------------------------------------------------------------
/evaluate/evaluate_f0.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Wen-Chin Huang and Tomoki Hayashi
  4 | #  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
  5 | 
  6 | """Evaluate log-F0 RMSE between generated and groundtruth audios based on World."""
  7 | 
  8 | import argparse
  9 | import fnmatch
 10 | import logging
 11 | import multiprocessing as mp
 12 | import os
 13 | from typing import Dict, List, Tuple
 14 | 
 15 | import librosa
 16 | import numpy as np
 17 | import pysptk
 18 | import pyworld as pw
 19 | import soundfile as sf
 20 | from fastdtw import fastdtw
 21 | from scipy import spatial
 22 | 
 23 | 
 24 | def find_files(
 25 |     root_dir: str, query: List[str] = ["*.flac", "*.wav"], include_root_dir: bool = True
 26 | ) -> List[str]:
 27 |     """Find files recursively.
 28 | 
 29 |     Args:
 30 |         root_dir (str): Root root_dir to find.
 31 |         query (List[str]): Query to find.
 32 |         include_root_dir (bool): If False, root_dir name is not included.
 33 | 
 34 |     Returns:
 35 |         List[str]: List of found filenames.
 36 | 
 37 |     """
 38 |     files = []
 39 |     for root, dirnames, filenames in os.walk(root_dir, followlinks=True):
 40 |         for q in query:
 41 |             for filename in fnmatch.filter(filenames, q):
 42 |                 files.append(os.path.join(root, filename))
 43 |     if not include_root_dir:
 44 |         files = [file_.replace(root_dir + "/", "") for file_ in files]
 45 | 
 46 |     return files
 47 | 
 48 | 
 49 | def world_extract(
 50 |     x: np.ndarray,
 51 |     fs: int,
 52 |     f0min: int = 40,
 53 |     f0max: int = 800,
 54 |     n_fft: int = 512,
 55 |     n_shift: int = 256,
 56 |     mcep_dim: int = 25,
 57 |     mcep_alpha: float = 0.41,
 58 | ) -> np.ndarray:
 59 |     """Extract World-based acoustic features.
 60 | 
 61 |     Args:
 62 |         x (ndarray): 1D waveform array.
 63 |         fs (int): Minimum f0 value (default=40).
 64 |         f0 (int): Maximum f0 value (default=800).
 65 |         n_shift (int): Shift length in point (default=256).
 66 |         n_fft (int): FFT length in point (default=512).
 67 |         n_shift (int): Shift length in point (default=256).
 68 |         mcep_dim (int): Dimension of mel-cepstrum (default=25).
 69 |         mcep_alpha (float): All pass filter coefficient (default=0.41).
 70 | 
 71 |     Returns:
 72 |         ndarray: Mel-cepstrum with the size (N, n_fft).
 73 |         ndarray: F0 sequence (N,).
 74 | 
 75 |     """
 76 |     # extract features
 77 |     x = x.astype(np.float64)
 78 |     f0, time_axis = pw.harvest(
 79 |         x,
 80 |         fs,
 81 |         f0_floor=f0min,
 82 |         f0_ceil=f0max,
 83 |         frame_period=n_shift / fs * 1000,
 84 |     )
 85 |     sp = pw.cheaptrick(x, f0, time_axis, fs, fft_size=n_fft)
 86 |     if mcep_dim is None or mcep_alpha is None:
 87 |         mcep_dim, mcep_alpha = _get_best_mcep_params(fs)
 88 |     mcep = pysptk.sp2mc(sp, mcep_dim, mcep_alpha)
 89 | 
 90 |     return mcep, f0
 91 | 
 92 | 
 93 | def _get_basename(path: str) -> str:
 94 |     return os.path.splitext(os.path.split(path)[-1])[0]
 95 | 
 96 | 
 97 | def _get_best_mcep_params(fs: int) -> Tuple[int, float]:
 98 |     if fs == 16000:
 99 |         return 23, 0.42
100 |     elif fs == 22050:
101 |         return 34, 0.45
102 |     elif fs == 24000:
103 |         return 34, 0.46
104 |     elif fs == 44100:
105 |         return 39, 0.53
106 |     elif fs == 48000:
107 |         return 39, 0.55
108 |     else:
109 |         raise ValueError(f"Not found the setting for {fs}.")
110 | 
111 | 
112 | def calculate(
113 |     file_list: List[str],
114 |     gt_file_list: List[str],
115 |     args: argparse.Namespace,
116 |     f0_rmse_dict: Dict[str, float],
117 | ):
118 |     """Calculate log-F0 RMSE."""
119 |     for i, gen_path in enumerate(file_list):
120 |         corresponding_list = list(
121 |             filter(
122 |                 lambda gt_path: _get_basename(gt_path)[:-7] in gen_path, gt_file_list
123 |             )
124 |         )
125 |         assert len(corresponding_list) == 1
126 |         gt_path = corresponding_list[0]
127 |         gt_basename = _get_basename(gt_path)
128 | 
129 |         # load wav file as int16
130 |         gen_x, gen_fs = sf.read(gen_path, dtype="int16")
131 |         gt_x, gt_fs = sf.read(gt_path, dtype="int16")
132 | 
133 |         fs = gen_fs
134 |         if gen_fs != gt_fs:
135 |             gt_x = librosa.resample(gt_x.astype(np.float), gt_fs, gen_fs)
136 | 
137 |         # extract ground truth and converted features
138 |         gen_mcep, gen_f0 = world_extract(
139 |             x=gen_x,
140 |             fs=fs,
141 |             f0min=args.f0min,
142 |             f0max=args.f0max,
143 |             n_fft=args.n_fft,
144 |             n_shift=args.n_shift,
145 |             mcep_dim=args.mcep_dim,
146 |             mcep_alpha=args.mcep_alpha,
147 |         )
148 |         gt_mcep, gt_f0 = world_extract(
149 |             x=gt_x,
150 |             fs=fs,
151 |             f0min=args.f0min,
152 |             f0max=args.f0max,
153 |             n_fft=args.n_fft,
154 |             n_shift=args.n_shift,
155 |             mcep_dim=args.mcep_dim,
156 |             mcep_alpha=args.mcep_alpha,
157 |         )
158 | 
159 |         # DTW
160 |         _, path = fastdtw(gen_mcep, gt_mcep, dist=spatial.distance.euclidean)
161 |         twf = np.array(path).T
162 |         gen_f0_dtw = gen_f0[twf[0]]
163 |         gt_f0_dtw = gt_f0[twf[1]]
164 | 
165 |         # Get voiced part
166 |         nonzero_idxs = np.where((gen_f0_dtw != 0) & (gt_f0_dtw != 0))[0]
167 |         gen_f0_dtw_voiced = np.log(gen_f0_dtw[nonzero_idxs])
168 |         gt_f0_dtw_voiced = np.log(gt_f0_dtw[nonzero_idxs])
169 | 
170 |         # log F0 RMSE
171 |         log_f0_rmse = np.sqrt(np.mean((gen_f0_dtw_voiced - gt_f0_dtw_voiced) ** 2))
172 |         logging.info(f"{gt_basename} {log_f0_rmse:.4f}")
173 |         f0_rmse_dict[gt_basename] = log_f0_rmse
174 | 
175 | 
176 | def get_parser() -> argparse.Namespace:
177 |     """Get argument parser."""
178 |     parser = argparse.ArgumentParser(description="Evaluate Mel-cepstrum distortion.")
179 |     parser.add_argument(
180 |         "gen_wavdir_or_wavscp",
181 |         type=str,
182 |         help="Path of directory or wav.scp for generated waveforms.",
183 |     )
184 |     parser.add_argument(
185 |         "gt_wavdir_or_wavscp",
186 |         type=str,
187 |         help="Path of directory or wav.scp for ground truth waveforms.",
188 |     )
189 |     parser.add_argument(
190 |         "--outdir",
191 |         type=str,
192 |         help="Path of directory to write the results.",
193 |     )
194 | 
195 |     # analysis related
196 |     parser.add_argument(
197 |         "--mcep_dim",
198 |         default=None,
199 |         type=int,
200 |         help=(
201 |             "Dimension of mel cepstrum coefficients. "
202 |             "If None, automatically set to the best dimension for the sampling."
203 |         ),
204 |     )
205 |     parser.add_argument(
206 |         "--mcep_alpha",
207 |         default=None,
208 |         type=float,
209 |         help=(
210 |             "All pass constant for mel-cepstrum analysis. "
211 |             "If None, automatically set to the best dimension for the sampling."
212 |         ),
213 |     )
214 |     parser.add_argument(
215 |         "--n_fft",
216 |         default=1024,
217 |         type=int,
218 |         help="The number of FFT points.",
219 |     )
220 |     parser.add_argument(
221 |         "--n_shift",
222 |         default=256,
223 |         type=int,
224 |         help="The number of shift points.",
225 |     )
226 |     parser.add_argument(
227 |         "--f0min",
228 |         default=40,
229 |         type=int,
230 |         help="Minimum f0 value.",
231 |     )
232 |     parser.add_argument(
233 |         "--f0max",
234 |         default=800,
235 |         type=int,
236 |         help="Maximum f0 value.",
237 |     )
238 |     parser.add_argument(
239 |         "--nj",
240 |         default=16,
241 |         type=int,
242 |         help="Number of parallel jobs.",
243 |     )
244 |     parser.add_argument(
245 |         "--verbose",
246 |         default=1,
247 |         type=int,
248 |         help="Verbosity level. Higher is more logging.",
249 |     )
250 |     return parser
251 | 
252 | 
253 | def main():
254 |     """Run log-F0 RMSE calculation in parallel."""
255 |     args = get_parser().parse_args()
256 | 
257 |     # logging info
258 |     if args.verbose > 1:
259 |         logging.basicConfig(
260 |             level=logging.DEBUG,
261 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
262 |         )
263 |     elif args.verbose > 0:
264 |         logging.basicConfig(
265 |             level=logging.INFO,
266 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
267 |         )
268 |     else:
269 |         logging.basicConfig(
270 |             level=logging.WARN,
271 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
272 |         )
273 |         logging.warning("Skip DEBUG/INFO messages")
274 | 
275 |     # find files
276 |     if os.path.isdir(args.gen_wavdir_or_wavscp):
277 |         gen_files = sorted(find_files(args.gen_wavdir_or_wavscp))
278 |     else:
279 |         with open(args.gen_wavdir_or_wavscp) as f:
280 |             gen_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
281 |         if gen_files[0].endswith("|"):
282 |             raise ValueError("Not supported wav.scp format.")
283 |     if os.path.isdir(args.gt_wavdir_or_wavscp):
284 |         gt_files = sorted(find_files(args.gt_wavdir_or_wavscp))
285 |     else:
286 |         with open(args.gt_wavdir_or_wavscp) as f:
287 |             gt_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
288 |         if gt_files[0].endswith("|"):
289 |             raise ValueError("Not supported wav.scp format.")
290 | 
291 |     # Get and divide list
292 |     if len(gen_files) == 0:
293 |         raise FileNotFoundError("Not found any generated audio files.")
294 |     if len(gen_files) > len(gt_files):
295 |         raise ValueError(
296 |             "#groundtruth files are less than #generated files "
297 |             f"(#gen={len(gen_files)} vs. #gt={len(gt_files)}). "
298 |             "Please check the groundtruth directory."
299 |         )
300 |     logging.info("The number of utterances = %d" % len(gen_files))
301 |     file_lists = np.array_split(gen_files, args.nj)
302 |     file_lists = [f_list.tolist() for f_list in file_lists]
303 | 
304 |     # multi processing
305 |     with mp.Manager() as manager:
306 |         log_f0_rmse_dict = manager.dict()
307 |         processes = []
308 |         # for f in file_lists:
309 |         #     calculate(f, gt_files, args, log_f0_rmse_dict)
310 |         for f in file_lists:
311 |             p = mp.Process(target=calculate, args=(f, gt_files, args, log_f0_rmse_dict))
312 |             p.start()
313 |             processes.append(p)
314 | 
315 |         # wait for all process
316 |         for p in processes:
317 |             p.join()
318 | 
319 |         # convert to standard list
320 |         log_f0_rmse_dict = dict(log_f0_rmse_dict)
321 | 
322 |         # calculate statistics
323 |         mean_log_f0_rmse = np.mean(np.array([v for v in log_f0_rmse_dict.values()]))
324 |         std_log_f0_rmse = np.std(np.array([v for v in log_f0_rmse_dict.values()]))
325 |         logging.info(f"Average: {mean_log_f0_rmse:.4f} ± {std_log_f0_rmse:.4f}")
326 | 
327 |     # write results
328 |     if args.outdir is None:
329 |         if os.path.isdir(args.gen_wavdir_or_wavscp):
330 |             args.outdir = args.gen_wavdir_or_wavscp
331 |         else:
332 |             args.outdir = os.path.dirname(args.gen_wavdir_or_wavscp)
333 |     os.makedirs(args.outdir, exist_ok=True)
334 |     with open(f"{args.outdir}/utt2log_f0_rmse", "w") as f:
335 |         for utt_id in sorted(log_f0_rmse_dict.keys()):
336 |             log_f0_rmse = log_f0_rmse_dict[utt_id]
337 |             f.write(f"{utt_id} {log_f0_rmse:.4f}\n")
338 |     with open(f"{args.outdir}/log_f0_rmse_avg_result.txt", "w") as f:
339 |         f.write(f"#utterances: {len(gen_files)}\n")
340 |         f.write(f"Average: {mean_log_f0_rmse:.4f} ± {std_log_f0_rmse:.4f}")
341 | 
342 |     logging.info("Successfully finished log-F0 RMSE evaluation.")
343 | 
344 | 
345 | if __name__ == "__main__":
346 |     main()
347 | 


--------------------------------------------------------------------------------
/evaluate/evaluate_mcd.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2020 Wen-Chin Huang and Tomoki Hayashi
  4 | #  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
  5 | 
  6 | """Evaluate MCD between generated and groundtruth audios with SPTK-based mcep."""
  7 | 
  8 | import argparse
  9 | import fnmatch
 10 | import logging
 11 | import multiprocessing as mp
 12 | import os
 13 | from typing import Dict, List, Tuple
 14 | 
 15 | import librosa
 16 | import numpy as np
 17 | import pysptk
 18 | import soundfile as sf
 19 | from fastdtw import fastdtw
 20 | from scipy import spatial
 21 | 
 22 | 
 23 | def find_files(
 24 |     root_dir: str, query: List[str] = ["*.flac", "*.wav"], include_root_dir: bool = True
 25 | ) -> List[str]:
 26 |     """Find files recursively.
 27 | 
 28 |     Args:
 29 |         root_dir (str): Root root_dir to find.
 30 |         query (List[str]): Query to find.
 31 |         include_root_dir (bool): If False, root_dir name is not included.
 32 | 
 33 |     Returns:
 34 |         List[str]: List of found filenames.
 35 | 
 36 |     """
 37 |     files = []
 38 |     for root, dirnames, filenames in os.walk(root_dir, followlinks=True):
 39 |         for q in query:
 40 |             for filename in fnmatch.filter(filenames, q):
 41 |                 files.append(os.path.join(root, filename))
 42 |     if not include_root_dir:
 43 |         files = [file_.replace(root_dir + "/", "") for file_ in files]
 44 | 
 45 |     return files
 46 | 
 47 | 
 48 | def sptk_extract(
 49 |     x: np.ndarray,
 50 |     fs: int,
 51 |     n_fft: int = 512,
 52 |     n_shift: int = 256,
 53 |     mcep_dim: int = 25,
 54 |     mcep_alpha: float = 0.41,
 55 |     is_padding: bool = False,
 56 | ) -> np.ndarray:
 57 |     """Extract SPTK-based mel-cepstrum.
 58 | 
 59 |     Args:
 60 |         x (ndarray): 1D waveform array.
 61 |         fs (int): Sampling rate
 62 |         n_fft (int): FFT length in point (default=512).
 63 |         n_shift (int): Shift length in point (default=256).
 64 |         mcep_dim (int): Dimension of mel-cepstrum (default=25).
 65 |         mcep_alpha (float): All pass filter coefficient (default=0.41).
 66 |         is_padding (bool): Whether to pad the end of signal (default=False).
 67 | 
 68 |     Returns:
 69 |         ndarray: Mel-cepstrum with the size (N, n_fft).
 70 | 
 71 |     """
 72 |     # perform padding
 73 |     if is_padding:
 74 |         n_pad = n_fft - (len(x) - n_fft) % n_shift
 75 |         x = np.pad(x, (0, n_pad), "reflect")
 76 | 
 77 |     # get number of frames
 78 |     n_frame = (len(x) - n_fft) // n_shift + 1
 79 | 
 80 |     # get window function
 81 |     win = pysptk.sptk.hamming(n_fft)
 82 | 
 83 |     # check mcep and alpha
 84 |     if mcep_dim is None or mcep_alpha is None:
 85 |         mcep_dim, mcep_alpha = _get_best_mcep_params(fs)
 86 | 
 87 |     # calculate spectrogram
 88 |     mcep = [
 89 |         pysptk.mcep(
 90 |             x[n_shift * i : n_shift * i + n_fft] * win,
 91 |             mcep_dim,
 92 |             mcep_alpha,
 93 |             eps=1e-6,
 94 |             etype=1,
 95 |         )
 96 |         for i in range(n_frame)
 97 |     ]
 98 | 
 99 |     return np.stack(mcep)
100 | 
101 | 
102 | def _get_basename(path: str) -> str:
103 |     return os.path.splitext(os.path.split(path)[-1])[0]
104 | 
105 | 
106 | def _get_best_mcep_params(fs: int) -> Tuple[int, float]:
107 |     if fs == 16000:
108 |         return 23, 0.42
109 |     elif fs == 22050:
110 |         return 34, 0.45
111 |     elif fs == 24000:
112 |         return 34, 0.46
113 |     elif fs == 44100:
114 |         return 39, 0.53
115 |     elif fs == 48000:
116 |         return 39, 0.55
117 |     else:
118 |         raise ValueError(f"Not found the setting for {fs}.")
119 | 
120 | 
121 | def calculate(
122 |     file_list: List[str],
123 |     gt_file_list: List[str],
124 |     args: argparse.Namespace,
125 |     mcd_dict: Dict,
126 | ):
127 |     """Calculate MCD."""
128 |     for i, gen_path in enumerate(file_list):
129 |         corresponding_list = list(
130 |             filter(
131 |                 lambda gt_path: _get_basename(gt_path)[:-7] in gen_path, gt_file_list
132 |             )
133 |         )
134 |         print("corresponding_list", corresponding_list)
135 |         assert len(corresponding_list) == 1
136 |         gt_path = corresponding_list[0]
137 |         gt_basename = _get_basename(gt_path)
138 | 
139 |         # load wav file as int16
140 |         gen_x, gen_fs = sf.read(gen_path, dtype="int16")
141 |         gt_x, gt_fs = sf.read(gt_path, dtype="int16")
142 | 
143 |         fs = gen_fs
144 |         if gen_fs != gt_fs:
145 |             gt_x = librosa.resample(gt_x.astype(np.float), gt_fs, gen_fs)
146 | 
147 |         # extract ground truth and converted features
148 |         gen_mcep = sptk_extract(
149 |             x=gen_x,
150 |             fs=fs,
151 |             n_fft=args.n_fft,
152 |             n_shift=args.n_shift,
153 |             mcep_dim=args.mcep_dim,
154 |             mcep_alpha=args.mcep_alpha,
155 |         )
156 |         gt_mcep = sptk_extract(
157 |             x=gt_x,
158 |             fs=fs,
159 |             n_fft=args.n_fft,
160 |             n_shift=args.n_shift,
161 |             mcep_dim=args.mcep_dim,
162 |             mcep_alpha=args.mcep_alpha,
163 |         )
164 | 
165 |         # DTW
166 |         _, path = fastdtw(gen_mcep, gt_mcep, dist=spatial.distance.euclidean)
167 |         twf = np.array(path).T
168 |         gen_mcep_dtw = gen_mcep[twf[0]]
169 |         gt_mcep_dtw = gt_mcep[twf[1]]
170 | 
171 |         # MCD
172 |         diff2sum = np.sum((gen_mcep_dtw - gt_mcep_dtw) ** 2, 1)
173 |         mcd = np.mean(10.0 / np.log(10.0) * np.sqrt(2 * diff2sum), 0)
174 |         logging.info(f"{gt_basename} {mcd:.4f}")
175 |         mcd_dict[gt_basename] = mcd
176 | 
177 | 
178 | def get_parser() -> argparse.Namespace:
179 |     """Get argument parser."""
180 |     parser = argparse.ArgumentParser(description="Evaluate Mel-cepstrum distortion.")
181 |     parser.add_argument(
182 |         "gen_wavdir_or_wavscp",
183 |         type=str,
184 |         help="Path of directory or wav.scp for generated waveforms.",
185 |     )
186 |     parser.add_argument(
187 |         "gt_wavdir_or_wavscp",
188 |         type=str,
189 |         help="Path of directory or wav.scp for ground truth waveforms.",
190 |     )
191 |     parser.add_argument(
192 |         "--outdir",
193 |         type=str,
194 |         help="Path of directory to write the results.",
195 |     )
196 | 
197 |     # analysis related
198 |     parser.add_argument(
199 |         "--mcep_dim",
200 |         default=None,
201 |         type=int,
202 |         help=(
203 |             "Dimension of mel cepstrum coefficients. "
204 |             "If None, automatically set to the best dimension for the sampling."
205 |         ),
206 |     )
207 |     parser.add_argument(
208 |         "--mcep_alpha",
209 |         default=None,
210 |         type=float,
211 |         help=(
212 |             "All pass constant for mel-cepstrum analysis. "
213 |             "If None, automatically set to the best dimension for the sampling."
214 |         ),
215 |     )
216 |     parser.add_argument(
217 |         "--n_fft",
218 |         default=1024,
219 |         type=int,
220 |         help="The number of FFT points.",
221 |     )
222 |     parser.add_argument(
223 |         "--n_shift",
224 |         default=256,
225 |         type=int,
226 |         help="The number of shift points.",
227 |     )
228 |     parser.add_argument(
229 |         "--nj",
230 |         default=16,
231 |         type=int,
232 |         help="Number of parallel jobs.",
233 |     )
234 |     parser.add_argument(
235 |         "--verbose",
236 |         default=1,
237 |         type=int,
238 |         help="Verbosity level. Higher is more logging.",
239 |     )
240 |     return parser
241 | 
242 | 
243 | def main():
244 |     """Run MCD calculation in parallel."""
245 |     args = get_parser().parse_args()
246 | 
247 |     # logging info
248 |     # if args.verbose > 1:
249 |     #     logging.basicConfig(
250 |     #         level=logging.DEBUG,
251 |     #         format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
252 |     #     )
253 |     # elif args.verbose > 0:
254 |     #     logging.basicConfig(
255 |     #         level=logging.INFO,
256 |     #         format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
257 |     #     )
258 |     # else:
259 |     #     logging.basicConfig(
260 |     #         level=logging.WARN,
261 |     #         format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
262 |     #     )
263 |     #     logging.warning("Skip DEBUG/INFO messages")
264 | 
265 |     # find files
266 |     if os.path.isdir(args.gen_wavdir_or_wavscp):
267 |         gen_files = sorted(find_files(args.gen_wavdir_or_wavscp))
268 |     else:
269 |         with open(args.gen_wavdir_or_wavscp) as f:
270 |             gen_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
271 |         if gen_files[0].endswith("|"):
272 |             raise ValueError("Not supported wav.scp format.")
273 |     if os.path.isdir(args.gt_wavdir_or_wavscp):
274 |         gt_files = sorted(find_files(args.gt_wavdir_or_wavscp))
275 |     else:
276 |         with open(args.gt_wavdir_or_wavscp) as f:
277 |             gt_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
278 |         if gt_files[0].endswith("|"):
279 |             raise ValueError("Not supported wav.scp format.")
280 | 
281 |     # Get and divide list
282 |     if len(gen_files) == 0:
283 |         raise FileNotFoundError("Not found any generated audio files.")
284 |     if len(gen_files) > len(gt_files):
285 |         raise ValueError(
286 |             "#groundtruth files are less than #generated files "
287 |             f"(#gen={len(gen_files)} vs. #gt={len(gt_files)}). "
288 |             "Please check the groundtruth directory."
289 |         )
290 |     logging.info("The number of utterances = %d" % len(gen_files))
291 |     file_lists = np.array_split(gen_files, args.nj)
292 |     file_lists = [f_list.tolist() for f_list in file_lists]
293 | 
294 |     # multi processing
295 |     with mp.Manager() as manager:
296 |         mcd_dict = manager.dict()
297 |         processes = []
298 |         for f in file_lists:
299 |             p = mp.Process(target=calculate, args=(f, gt_files, args, mcd_dict))
300 |             p.start()
301 |             processes.append(p)
302 | 
303 |         # wait for all process
304 |         for p in processes:
305 |             p.join()
306 | 
307 |         # convert to standard list
308 |         mcd_dict = dict(mcd_dict)
309 | 
310 |         # calculate statistics
311 |         mean_mcd = np.mean(np.array([v for v in mcd_dict.values()]))
312 |         std_mcd = np.std(np.array([v for v in mcd_dict.values()]))
313 |         logging.info(f"Average: {mean_mcd:.4f} ± {std_mcd:.4f}")
314 | 
315 |     # write results
316 |     if args.outdir is None:
317 |         if os.path.isdir(args.gen_wavdir_or_wavscp):
318 |             args.outdir = args.gen_wavdir_or_wavscp
319 |         else:
320 |             args.outdir = os.path.dirname(args.gen_wavdir_or_wavscp)
321 |     os.makedirs(args.outdir, exist_ok=True)
322 |     with open(f"{args.outdir}/utt2mcd", "w") as f:
323 |         for utt_id in sorted(mcd_dict.keys()):
324 |             mcd = mcd_dict[utt_id]
325 |             f.write(f"{utt_id} {mcd:.4f}\n")
326 |     with open(f"{args.outdir}/mcd_avg_result.txt", "w") as f:
327 |         f.write(f"#utterances: {len(gen_files)}\n")
328 |         f.write(f"Average: {mean_mcd:.4f} ± {std_mcd:.4f}")
329 | 
330 |     logging.info("Successfully finished MCD evaluation.")
331 | 
332 | 
333 | if __name__ == "__main__":
334 |     main()
335 | 


--------------------------------------------------------------------------------
/evaluate/evaluate_semitone.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Wen-Chin Huang and Tomoki Hayashi
  4 | # Copyright 2022 Shuai Guo
  5 | #  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
  6 | 
  7 | """Evaluate semitone ACC between generated and groundtruth audios based on World."""
  8 | 
  9 | import argparse
 10 | import fnmatch
 11 | import logging
 12 | import multiprocessing as mp
 13 | import os
 14 | from math import log2, pow
 15 | from typing import Dict, List, Tuple
 16 | 
 17 | import librosa
 18 | import numpy as np
 19 | import pysptk
 20 | import pyworld as pw
 21 | import soundfile as sf
 22 | from fastdtw import fastdtw
 23 | from scipy import spatial
 24 | 
 25 | 
 26 | def _Hz2Semitone(freq):
 27 |     """_Hz2Semitone."""
 28 |     A4 = 440
 29 |     C0 = A4 * pow(2, -4.75)
 30 |     name = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"]
 31 | 
 32 |     if freq == 0:
 33 |         return "Sil"  # silence
 34 |     else:
 35 |         h = round(12 * log2(freq / C0))
 36 |         octave = h // 12
 37 |         n = h % 12
 38 |         return name[n] + "_" + str(octave)
 39 | 
 40 | 
 41 | def find_files(
 42 |     root_dir: str, query: List[str] = ["*.flac", "*.wav"], include_root_dir: bool = True
 43 | ) -> List[str]:
 44 |     """Find files recursively.
 45 | 
 46 |     Args:
 47 |         root_dir (str): Root root_dir to find.
 48 |         query (List[str]): Query to find.
 49 |         include_root_dir (bool): If False, root_dir name is not included.
 50 | 
 51 |     Returns:
 52 |         List[str]: List of found filenames.
 53 | 
 54 |     """
 55 |     files = []
 56 |     for root, dirnames, filenames in os.walk(root_dir, followlinks=True):
 57 |         for q in query:
 58 |             for filename in fnmatch.filter(filenames, q):
 59 |                 files.append(os.path.join(root, filename))
 60 |     if not include_root_dir:
 61 |         files = [file_.replace(root_dir + "/", "") for file_ in files]
 62 | 
 63 |     return files
 64 | 
 65 | 
 66 | def world_extract(
 67 |     x: np.ndarray,
 68 |     fs: int,
 69 |     f0min: int = 40,
 70 |     f0max: int = 800,
 71 |     n_fft: int = 512,
 72 |     n_shift: int = 256,
 73 |     mcep_dim: int = 25,
 74 |     mcep_alpha: float = 0.41,
 75 | ) -> np.ndarray:
 76 |     """Extract World-based acoustic features.
 77 | 
 78 |     Args:
 79 |         x (ndarray): 1D waveform array.
 80 |         fs (int): Minimum f0 value (default=40).
 81 |         f0 (int): Maximum f0 value (default=800).
 82 |         n_shift (int): Shift length in point (default=256).
 83 |         n_fft (int): FFT length in point (default=512).
 84 |         n_shift (int): Shift length in point (default=256).
 85 |         mcep_dim (int): Dimension of mel-cepstrum (default=25).
 86 |         mcep_alpha (float): All pass filter coefficient (default=0.41).
 87 | 
 88 |     Returns:
 89 |         ndarray: Mel-cepstrum with the size (N, n_fft).
 90 |         ndarray: F0 sequence (N,).
 91 | 
 92 |     """
 93 |     # extract features
 94 |     x = x.astype(np.float64)
 95 |     f0, time_axis = pw.harvest(
 96 |         x,
 97 |         fs,
 98 |         f0_floor=f0min,
 99 |         f0_ceil=f0max,
100 |         frame_period=n_shift / fs * 1000,
101 |     )
102 |     sp = pw.cheaptrick(x, f0, time_axis, fs, fft_size=n_fft)
103 |     if mcep_dim is None or mcep_alpha is None:
104 |         mcep_dim, mcep_alpha = _get_best_mcep_params(fs)
105 |     mcep = pysptk.sp2mc(sp, mcep_dim, mcep_alpha)
106 | 
107 |     return mcep, f0
108 | 
109 | 
110 | def _get_basename(path: str) -> str:
111 |     return os.path.splitext(os.path.split(path)[-1])[0]
112 | 
113 | 
114 | def _get_best_mcep_params(fs: int) -> Tuple[int, float]:
115 |     if fs == 16000:
116 |         return 23, 0.42
117 |     elif fs == 22050:
118 |         return 34, 0.45
119 |     elif fs == 24000:
120 |         return 34, 0.46
121 |     elif fs == 44100:
122 |         return 39, 0.53
123 |     elif fs == 48000:
124 |         return 39, 0.55
125 |     else:
126 |         raise ValueError(f"Not found the setting for {fs}.")
127 | 
128 | 
129 | def calculate(
130 |     file_list: List[str],
131 |     gt_file_list: List[str],
132 |     args: argparse.Namespace,
133 |     semitone_acc_dict: Dict[str, float],
134 | ):
135 |     """Calculate semitone ACC."""
136 |     for i, gen_path in enumerate(file_list):
137 |         corresponding_list = list(
138 |             filter(
139 |                 lambda gt_path: _get_basename(gt_path)[:-7] in gen_path, gt_file_list
140 |             )
141 |         )
142 |         assert len(corresponding_list) == 1
143 |         gt_path = corresponding_list[0]
144 |         gt_basename = _get_basename(gt_path)
145 | 
146 |         # load wav file as int16
147 |         gen_x, gen_fs = sf.read(gen_path, dtype="int16")
148 |         gt_x, gt_fs = sf.read(gt_path, dtype="int16")
149 | 
150 |         fs = gen_fs
151 |         if gen_fs != gt_fs:
152 |             gt_x = librosa.resample(gt_x.astype(np.float), gt_fs, gen_fs)
153 | 
154 |         # extract ground truth and converted features
155 |         gen_mcep, gen_f0 = world_extract(
156 |             x=gen_x,
157 |             fs=fs,
158 |             f0min=args.f0min,
159 |             f0max=args.f0max,
160 |             n_fft=args.n_fft,
161 |             n_shift=args.n_shift,
162 |             mcep_dim=args.mcep_dim,
163 |             mcep_alpha=args.mcep_alpha,
164 |         )
165 |         gt_mcep, gt_f0 = world_extract(
166 |             x=gt_x,
167 |             fs=fs,
168 |             f0min=args.f0min,
169 |             f0max=args.f0max,
170 |             n_fft=args.n_fft,
171 |             n_shift=args.n_shift,
172 |             mcep_dim=args.mcep_dim,
173 |             mcep_alpha=args.mcep_alpha,
174 |         )
175 | 
176 |         # DTW
177 |         _, path = fastdtw(gen_mcep, gt_mcep, dist=spatial.distance.euclidean)
178 |         twf = np.array(path).T
179 |         gen_f0_dtw = gen_f0[twf[0]]
180 |         gt_f0_dtw = gt_f0[twf[1]]
181 | 
182 |         # Semitone ACC
183 |         semitone_GT = np.array([_Hz2Semitone(_f0) for _f0 in gt_f0_dtw])
184 |         semitone_predict = np.array([_Hz2Semitone(_f0) for _f0 in gen_f0_dtw])
185 |         semitone_ACC = float((semitone_GT == semitone_predict).sum()) / len(semitone_GT)
186 |         semitone_acc_dict[gt_basename] = semitone_ACC
187 | 
188 | 
189 | def get_parser() -> argparse.Namespace:
190 |     """Get argument parser."""
191 |     parser = argparse.ArgumentParser(description="Evaluate Mel-cepstrum distortion.")
192 |     parser.add_argument(
193 |         "gen_wavdir_or_wavscp",
194 |         type=str,
195 |         help="Path of directory or wav.scp for generated waveforms.",
196 |     )
197 |     parser.add_argument(
198 |         "gt_wavdir_or_wavscp",
199 |         type=str,
200 |         help="Path of directory or wav.scp for ground truth waveforms.",
201 |     )
202 |     parser.add_argument(
203 |         "--outdir",
204 |         type=str,
205 |         help="Path of directory to write the results.",
206 |     )
207 | 
208 |     # analysis related
209 |     parser.add_argument(
210 |         "--mcep_dim",
211 |         default=None,
212 |         type=int,
213 |         help=(
214 |             "Dimension of mel cepstrum coefficients. "
215 |             "If None, automatically set to the best dimension for the sampling."
216 |         ),
217 |     )
218 |     parser.add_argument(
219 |         "--mcep_alpha",
220 |         default=None,
221 |         type=float,
222 |         help=(
223 |             "All pass constant for mel-cepstrum analysis. "
224 |             "If None, automatically set to the best dimension for the sampling."
225 |         ),
226 |     )
227 |     parser.add_argument(
228 |         "--n_fft",
229 |         default=1024,
230 |         type=int,
231 |         help="The number of FFT points.",
232 |     )
233 |     parser.add_argument(
234 |         "--n_shift",
235 |         default=256,
236 |         type=int,
237 |         help="The number of shift points.",
238 |     )
239 |     parser.add_argument(
240 |         "--f0min",
241 |         default=40,
242 |         type=int,
243 |         help="Minimum f0 value.",
244 |     )
245 |     parser.add_argument(
246 |         "--f0max",
247 |         default=800,
248 |         type=int,
249 |         help="Maximum f0 value.",
250 |     )
251 |     parser.add_argument(
252 |         "--nj",
253 |         default=16,
254 |         type=int,
255 |         help="Number of parallel jobs.",
256 |     )
257 |     parser.add_argument(
258 |         "--verbose",
259 |         default=1,
260 |         type=int,
261 |         help="Verbosity level. Higher is more logging.",
262 |     )
263 |     return parser
264 | 
265 | 
266 | def main():
267 |     """Run semitone ACC calculation in parallel."""
268 |     args = get_parser().parse_args()
269 | 
270 |     # logging info
271 |     if args.verbose > 1:
272 |         logging.basicConfig(
273 |             level=logging.DEBUG,
274 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
275 |         )
276 |     elif args.verbose > 0:
277 |         logging.basicConfig(
278 |             level=logging.INFO,
279 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
280 |         )
281 |     else:
282 |         logging.basicConfig(
283 |             level=logging.WARN,
284 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
285 |         )
286 |         logging.warning("Skip DEBUG/INFO messages")
287 | 
288 |     # find files
289 |     if os.path.isdir(args.gen_wavdir_or_wavscp):
290 |         gen_files = sorted(find_files(args.gen_wavdir_or_wavscp))
291 |     else:
292 |         with open(args.gen_wavdir_or_wavscp) as f:
293 |             gen_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
294 |         if gen_files[0].endswith("|"):
295 |             raise ValueError("Not supported wav.scp format.")
296 |     if os.path.isdir(args.gt_wavdir_or_wavscp):
297 |         gt_files = sorted(find_files(args.gt_wavdir_or_wavscp))
298 |     else:
299 |         with open(args.gt_wavdir_or_wavscp) as f:
300 |             gt_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
301 |         if gt_files[0].endswith("|"):
302 |             raise ValueError("Not supported wav.scp format.")
303 | 
304 |     # Get and divide list
305 |     if len(gen_files) == 0:
306 |         raise FileNotFoundError("Not found any generated audio files.")
307 |     if len(gen_files) > len(gt_files):
308 |         raise ValueError(
309 |             "#groundtruth files are less than #generated files "
310 |             f"(#gen={len(gen_files)} vs. #gt={len(gt_files)}). "
311 |             "Please check the groundtruth directory."
312 |         )
313 |     logging.info("The number of utterances = %d" % len(gen_files))
314 |     file_lists = np.array_split(gen_files, args.nj)
315 |     file_lists = [f_list.tolist() for f_list in file_lists]
316 | 
317 |     # multi processing
318 |     with mp.Manager() as manager:
319 |         semitone_acc_dict = manager.dict()
320 |         processes = []
321 |         for f in file_lists:
322 |             p = mp.Process(
323 |                 target=calculate, args=(f, gt_files, args, semitone_acc_dict)
324 |             )
325 |             p.start()
326 |             processes.append(p)
327 | 
328 |         # wait for all process
329 |         for p in processes:
330 |             p.join()
331 | 
332 |         # convert to standard list
333 |         semitone_acc_dict = dict(semitone_acc_dict)
334 | 
335 |         # calculate statistics
336 |         mean_semitone_acc = np.mean(np.array([v for v in semitone_acc_dict.values()]))
337 |         logging.info(f"Average - Semitone_ACC: {mean_semitone_acc*100:.2f}%")
338 | 
339 |     # write results
340 |     if args.outdir is None:
341 |         if os.path.isdir(args.gen_wavdir_or_wavscp):
342 |             args.outdir = args.gen_wavdir_or_wavscp
343 |         else:
344 |             args.outdir = os.path.dirname(args.gen_wavdir_or_wavscp)
345 |     os.makedirs(args.outdir, exist_ok=True)
346 |     with open(f"{args.outdir}/utt2semitone_acc", "w") as f:
347 |         for utt_id in sorted(semitone_acc_dict.keys()):
348 |             semitone_ACC = semitone_acc_dict[utt_id]
349 |             f.write(f"{utt_id} {semitone_ACC*100:.2f}%\n")
350 |     with open(f"{args.outdir}/semitone_acc_avg_result.txt", "w") as f:
351 |         f.write(f"#utterances: {len(gen_files)}\n")
352 |         f.write(f"Average: {mean_semitone_acc*100:.2f}%")
353 | 
354 |     logging.info("Successfully finished semitone ACC evaluation.")
355 | 
356 | 
357 | if __name__ == "__main__":
358 |     main()
359 | 


--------------------------------------------------------------------------------
/evaluate/evaluate_vuv.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Wen-Chin Huang and Tomoki Hayashi
  4 | # Copyright 2022 Shuai Guo
  5 | #  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
  6 | 
  7 | """Evaluate VUV error between generated and groundtruth audios based on World."""
  8 | 
  9 | import argparse
 10 | import fnmatch
 11 | import logging
 12 | import multiprocessing as mp
 13 | import os
 14 | from typing import Dict, List, Tuple
 15 | 
 16 | import librosa
 17 | import numpy as np
 18 | import pysptk
 19 | import pyworld as pw
 20 | import soundfile as sf
 21 | from fastdtw import fastdtw
 22 | from scipy import spatial
 23 | 
 24 | 
 25 | def _Hz2Flag(freq):
 26 |     if freq == 0:
 27 |         return False
 28 |     else:
 29 |         return True
 30 | 
 31 | 
 32 | def find_files(
 33 |     root_dir: str, query: List[str] = ["*.flac", "*.wav"], include_root_dir: bool = True
 34 | ) -> List[str]:
 35 |     """Find files recursively.
 36 | 
 37 |     Args:
 38 |         root_dir (str): Root root_dir to find.
 39 |         query (List[str]): Query to find.
 40 |         include_root_dir (bool): If False, root_dir name is not included.
 41 | 
 42 |     Returns:
 43 |         List[str]: List of found filenames.
 44 | 
 45 |     """
 46 |     files = []
 47 |     for root, dirnames, filenames in os.walk(root_dir, followlinks=True):
 48 |         for q in query:
 49 |             for filename in fnmatch.filter(filenames, q):
 50 |                 files.append(os.path.join(root, filename))
 51 |     if not include_root_dir:
 52 |         files = [file_.replace(root_dir + "/", "") for file_ in files]
 53 | 
 54 |     return files
 55 | 
 56 | 
 57 | def world_extract(
 58 |     x: np.ndarray,
 59 |     fs: int,
 60 |     f0min: int = 40,
 61 |     f0max: int = 800,
 62 |     n_fft: int = 512,
 63 |     n_shift: int = 256,
 64 |     mcep_dim: int = 25,
 65 |     mcep_alpha: float = 0.41,
 66 | ) -> np.ndarray:
 67 |     """Extract World-based acoustic features.
 68 | 
 69 |     Args:
 70 |         x (ndarray): 1D waveform array.
 71 |         fs (int): Minimum f0 value (default=40).
 72 |         f0 (int): Maximum f0 value (default=800).
 73 |         n_shift (int): Shift length in point (default=256).
 74 |         n_fft (int): FFT length in point (default=512).
 75 |         n_shift (int): Shift length in point (default=256).
 76 |         mcep_dim (int): Dimension of mel-cepstrum (default=25).
 77 |         mcep_alpha (float): All pass filter coefficient (default=0.41).
 78 | 
 79 |     Returns:
 80 |         ndarray: Mel-cepstrum with the size (N, n_fft).
 81 |         ndarray: F0 sequence (N,).
 82 | 
 83 |     """
 84 |     # extract features
 85 |     x = x.astype(np.float64)
 86 |     f0, time_axis = pw.harvest(
 87 |         x,
 88 |         fs,
 89 |         f0_floor=f0min,
 90 |         f0_ceil=f0max,
 91 |         frame_period=n_shift / fs * 1000,
 92 |     )
 93 |     sp = pw.cheaptrick(x, f0, time_axis, fs, fft_size=n_fft)
 94 |     if mcep_dim is None or mcep_alpha is None:
 95 |         mcep_dim, mcep_alpha = _get_best_mcep_params(fs)
 96 |     mcep = pysptk.sp2mc(sp, mcep_dim, mcep_alpha)
 97 | 
 98 |     return mcep, f0
 99 | 
100 | 
101 | def _get_basename(path: str) -> str:
102 |     return os.path.splitext(os.path.split(path)[-1])[0]
103 | 
104 | 
105 | def _get_best_mcep_params(fs: int) -> Tuple[int, float]:
106 |     if fs == 16000:
107 |         return 23, 0.42
108 |     elif fs == 22050:
109 |         return 34, 0.45
110 |     elif fs == 24000:
111 |         return 34, 0.46
112 |     elif fs == 44100:
113 |         return 39, 0.53
114 |     elif fs == 48000:
115 |         return 39, 0.55
116 |     else:
117 |         raise ValueError(f"Not found the setting for {fs}.")
118 | 
119 | 
120 | def calculate(
121 |     file_list: List[str],
122 |     gt_file_list: List[str],
123 |     args: argparse.Namespace,
124 |     vuv_err_dict: Dict[str, float],
125 | ):
126 |     """Calculate VUV error."""
127 |     for i, gen_path in enumerate(file_list):
128 |         corresponding_list = list(
129 |             filter(
130 |                 lambda gt_path: _get_basename(gt_path)[:-7] in gen_path, gt_file_list
131 |             )
132 |         )
133 |         assert len(corresponding_list) == 1
134 |         gt_path = corresponding_list[0]
135 |         gt_basename = _get_basename(gt_path)
136 | 
137 |         # load wav file as int16
138 |         gen_x, gen_fs = sf.read(gen_path, dtype="int16")
139 |         gt_x, gt_fs = sf.read(gt_path, dtype="int16")
140 | 
141 |         fs = gen_fs
142 |         if gen_fs != gt_fs:
143 |             gt_x = librosa.resample(gt_x.astype(np.float), gt_fs, gen_fs)
144 | 
145 |         # extract ground truth and converted features
146 |         gen_mcep, gen_f0 = world_extract(
147 |             x=gen_x,
148 |             fs=fs,
149 |             f0min=args.f0min,
150 |             f0max=args.f0max,
151 |             n_fft=args.n_fft,
152 |             n_shift=args.n_shift,
153 |             mcep_dim=args.mcep_dim,
154 |             mcep_alpha=args.mcep_alpha,
155 |         )
156 |         gt_mcep, gt_f0 = world_extract(
157 |             x=gt_x,
158 |             fs=fs,
159 |             f0min=args.f0min,
160 |             f0max=args.f0max,
161 |             n_fft=args.n_fft,
162 |             n_shift=args.n_shift,
163 |             mcep_dim=args.mcep_dim,
164 |             mcep_alpha=args.mcep_alpha,
165 |         )
166 | 
167 |         # DTW
168 |         _, path = fastdtw(gen_mcep, gt_mcep, dist=spatial.distance.euclidean)
169 |         twf = np.array(path).T
170 |         gen_f0_dtw = gen_f0[twf[0]]
171 |         gt_f0_dtw = gt_f0[twf[1]]
172 | 
173 |         # VUV ERR
174 |         vuv_GT = np.array([_Hz2Flag(_f0) for _f0 in gt_f0_dtw])
175 |         vuv_predict = np.array([_Hz2Flag(_f0) for _f0 in gen_f0_dtw])
176 |         vuv_ERR = float((vuv_GT != vuv_predict).sum()) / len(vuv_GT)
177 |         vuv_err_dict[gt_basename] = vuv_ERR
178 | 
179 | 
180 | def get_parser() -> argparse.Namespace:
181 |     """Get argument parser."""
182 |     parser = argparse.ArgumentParser(description="Evaluate Mel-cepstrum distortion.")
183 |     parser.add_argument(
184 |         "gen_wavdir_or_wavscp",
185 |         type=str,
186 |         help="Path of directory or wav.scp for generated waveforms.",
187 |     )
188 |     parser.add_argument(
189 |         "gt_wavdir_or_wavscp",
190 |         type=str,
191 |         help="Path of directory or wav.scp for ground truth waveforms.",
192 |     )
193 |     parser.add_argument(
194 |         "--outdir",
195 |         type=str,
196 |         help="Path of directory to write the results.",
197 |     )
198 | 
199 |     # analysis related
200 |     parser.add_argument(
201 |         "--mcep_dim",
202 |         default=None,
203 |         type=int,
204 |         help=(
205 |             "Dimension of mel cepstrum coefficients. "
206 |             "If None, automatically set to the best dimension for the sampling."
207 |         ),
208 |     )
209 |     parser.add_argument(
210 |         "--mcep_alpha",
211 |         default=None,
212 |         type=float,
213 |         help=(
214 |             "All pass constant for mel-cepstrum analysis. "
215 |             "If None, automatically set to the best dimension for the sampling."
216 |         ),
217 |     )
218 |     parser.add_argument(
219 |         "--n_fft",
220 |         default=1024,
221 |         type=int,
222 |         help="The number of FFT points.",
223 |     )
224 |     parser.add_argument(
225 |         "--n_shift",
226 |         default=256,
227 |         type=int,
228 |         help="The number of shift points.",
229 |     )
230 |     parser.add_argument(
231 |         "--f0min",
232 |         default=40,
233 |         type=int,
234 |         help="Minimum f0 value.",
235 |     )
236 |     parser.add_argument(
237 |         "--f0max",
238 |         default=800,
239 |         type=int,
240 |         help="Maximum f0 value.",
241 |     )
242 |     parser.add_argument(
243 |         "--nj",
244 |         default=16,
245 |         type=int,
246 |         help="Number of parallel jobs.",
247 |     )
248 |     parser.add_argument(
249 |         "--verbose",
250 |         default=1,
251 |         type=int,
252 |         help="Verbosity level. Higher is more logging.",
253 |     )
254 |     return parser
255 | 
256 | 
257 | def main():
258 |     """Run VUV error calculation in parallel."""
259 |     args = get_parser().parse_args()
260 | 
261 |     # logging info
262 |     if args.verbose > 1:
263 |         logging.basicConfig(
264 |             level=logging.DEBUG,
265 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
266 |         )
267 |     elif args.verbose > 0:
268 |         logging.basicConfig(
269 |             level=logging.INFO,
270 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
271 |         )
272 |     else:
273 |         logging.basicConfig(
274 |             level=logging.WARN,
275 |             format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
276 |         )
277 |         logging.warning("Skip DEBUG/INFO messages")
278 | 
279 |     # find files
280 |     if os.path.isdir(args.gen_wavdir_or_wavscp):
281 |         gen_files = sorted(find_files(args.gen_wavdir_or_wavscp))
282 |     else:
283 |         with open(args.gen_wavdir_or_wavscp) as f:
284 |             gen_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
285 |         if gen_files[0].endswith("|"):
286 |             raise ValueError("Not supported wav.scp format.")
287 |     if os.path.isdir(args.gt_wavdir_or_wavscp):
288 |         gt_files = sorted(find_files(args.gt_wavdir_or_wavscp))
289 |     else:
290 |         with open(args.gt_wavdir_or_wavscp) as f:
291 |             gt_files = [line.strip().split(None, 1)[1] for line in f.readlines()]
292 |         if gt_files[0].endswith("|"):
293 |             raise ValueError("Not supported wav.scp format.")
294 | 
295 |     # Get and divide list
296 |     if len(gen_files) == 0:
297 |         raise FileNotFoundError("Not found any generated audio files.")
298 |     if len(gen_files) > len(gt_files):
299 |         raise ValueError(
300 |             "#groundtruth files are less than #generated files "
301 |             f"(#gen={len(gen_files)} vs. #gt={len(gt_files)}). "
302 |             "Please check the groundtruth directory."
303 |         )
304 |     logging.info("The number of utterances = %d" % len(gen_files))
305 |     file_lists = np.array_split(gen_files, args.nj)
306 |     file_lists = [f_list.tolist() for f_list in file_lists]
307 | 
308 |     # multi processing
309 |     with mp.Manager() as manager:
310 |         vuv_err_dict = manager.dict()
311 |         processes = []
312 |         for f in file_lists:
313 |             p = mp.Process(target=calculate, args=(f, gt_files, args, vuv_err_dict))
314 |             p.start()
315 |             processes.append(p)
316 | 
317 |         # wait for all process
318 |         for p in processes:
319 |             p.join()
320 | 
321 |         # convert to standard list
322 |         vuv_err_dict = dict(vuv_err_dict)
323 | 
324 |         # calculate statistics
325 |         mean_vuv_err = np.mean(np.array([v for v in vuv_err_dict.values()]))
326 |         logging.info(f"Average - VUV_ERROR: {mean_vuv_err*100:.2f}%")
327 | 
328 |     # write results
329 |     if args.outdir is None:
330 |         if os.path.isdir(args.gen_wavdir_or_wavscp):
331 |             args.outdir = args.gen_wavdir_or_wavscp
332 |         else:
333 |             args.outdir = os.path.dirname(args.gen_wavdir_or_wavscp)
334 |     os.makedirs(args.outdir, exist_ok=True)
335 |     with open(f"{args.outdir}/utt2vuv_error", "w") as f:
336 |         for utt_id in sorted(vuv_err_dict.keys()):
337 |             vuv_ERR = vuv_err_dict[utt_id]
338 |             f.write(f"{utt_id} {vuv_ERR*100:.2f}%\n")
339 |     with open(f"{args.outdir}/vuv_error_avg_result.txt", "w") as f:
340 |         f.write(f"#utterances: {len(gen_files)}\n")
341 |         f.write(f"Average: {mean_vuv_err*100:.2f}%")
342 | 
343 |     logging.info("Successfully finished VUV error evaluation.")
344 | 
345 | 
346 | if __name__ == "__main__":
347 |     main()
348 | 


--------------------------------------------------------------------------------
/evaluate_score.sh:
--------------------------------------------------------------------------------
 1 | echo "Generating"
 2 | python vsinging_infer.py
 3 | 
 4 | echo "Scoring"
 5 | 
 6 | 
 7 | _gt_wavscp="singing_gt"
 8 | _dir="evaluate"
 9 | _gen_wavdir="singing_out"
10 | 
11 | if [ ! -d "singing_gt" ] ; then
12 |     echo "copy gt"
13 |     mkdir -p "singing_gt"
14 |     python normalize_wav.py
15 | fi
16 | 
17 | # Objective Evaluation - MCD
18 | echo "Begin Scoring for MCD metrics on ${dset}, results are written under ${_dir}/MCD_res"
19 | 
20 | mkdir -p "${_dir}/MCD_res"
21 | python evaluate/evaluate_mcd.py \
22 |     ${_gen_wavdir} \
23 |     ${_gt_wavscp} \
24 |     --outdir "${_gen_wavdir}/MCD_res"
25 | 
26 | # Objective Evaluation - log-F0 RMSE
27 | echo "Begin Scoring for F0 related metrics on ${dset}, results are written under ${_dir}/F0_res"
28 | 
29 | mkdir -p "${_dir}/F0_res"
30 | python evaluate/evaluate_f0.py \
31 |     ${_gen_wavdir} \
32 |     ${_gt_wavscp} \
33 |     --outdir "${_gen_wavdir}/F0_res"
34 | 
35 | # Objective Evaluation - semitone ACC
36 | echo "Begin Scoring for SEMITONE related metrics on ${dset}, results are written under ${_dir}/SEMITONE_res"
37 | 
38 | mkdir -p "${_dir}/SEMITONE_res"
39 | python evaluate/evaluate_semitone.py \
40 |     ${_gen_wavdir} \
41 |     ${_gt_wavscp} \
42 |     --outdir "${_gen_wavdir}/SEMITONE_res"
43 | 
44 |     # Objective Evaluation - VUV error
45 | echo "Begin Scoring for VUV related metrics on ${dset}, results are written under ${_dir}/VUV_res"
46 | 
47 | mkdir -p "${_dir}/VUV_res"
48 | python evaluate/evaluate_vuv.py \
49 |     ${_gen_wavdir} \
50 |     ${_gt_wavscp} \
51 |     --outdir "${_gen_wavdir}/VUV_res"
52 | 
53 | zip singing_out.zip singing_out/*.wav


--------------------------------------------------------------------------------
/losses.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.nn import functional as F
 3 | 
 4 | import commons
 5 | 
 6 | 
 7 | def feature_loss(fmap_r, fmap_g):
 8 |     loss = 0
 9 |     for dr, dg in zip(fmap_r, fmap_g):
10 |         for rl, gl in zip(dr, dg):
11 |             rl = rl.float().detach()
12 |             gl = gl.float()
13 |             loss += torch.mean(torch.abs(rl - gl))
14 | 
15 |     return loss * 2
16 | 
17 | 
18 | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
19 |     loss = 0
20 |     r_losses = []
21 |     g_losses = []
22 |     for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
23 |         dr = dr.float()
24 |         dg = dg.float()
25 |         r_loss = torch.mean((1 - dr) ** 2)
26 |         g_loss = torch.mean(dg**2)
27 |         loss += r_loss + g_loss
28 |         r_losses.append(r_loss.item())
29 |         g_losses.append(g_loss.item())
30 | 
31 |     return loss, r_losses, g_losses
32 | 
33 | 
34 | def generator_loss(disc_outputs):
35 |     loss = 0
36 |     gen_losses = []
37 |     for dg in disc_outputs:
38 |         dg = dg.float()
39 |         l = torch.mean((1 - dg) ** 2)
40 |         gen_losses.append(l)
41 |         loss += l
42 | 
43 |     return loss, gen_losses
44 | 
45 | 
46 | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
47 |     """
48 |     z_p, logs_q: [b, h, t_t]
49 |     m_p, logs_p: [b, h, t_t]
50 |     """
51 |     z_p = z_p.float()
52 |     logs_q = logs_q.float()
53 |     m_p = m_p.float()
54 |     logs_p = logs_p.float()
55 |     z_mask = z_mask.float()
56 | 
57 |     kl = logs_p - logs_q - 0.5
58 |     kl += 0.5 * ((z_p - m_p) ** 2) * torch.exp(-2.0 * logs_p)
59 |     kl = torch.sum(kl * z_mask)
60 |     l = kl / torch.sum(z_mask)
61 |     return l
62 | 


--------------------------------------------------------------------------------
/mel_processing.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import os
  3 | import random
  4 | import torch
  5 | from torch import nn
  6 | import torch.nn.functional as F
  7 | import torch.utils.data
  8 | import numpy as np
  9 | import librosa
 10 | import librosa.util as librosa_util
 11 | from librosa.util import normalize, pad_center, tiny
 12 | from scipy.signal import get_window
 13 | from scipy.io.wavfile import read
 14 | from librosa.filters import mel as librosa_mel_fn
 15 | 
 16 | MAX_WAV_VALUE = 32768.0
 17 | 
 18 | 
 19 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
 20 |     """
 21 |     PARAMS
 22 |     ------
 23 |     C: compression factor
 24 |     """
 25 |     return torch.log(torch.clamp(x, min=clip_val) * C)
 26 | 
 27 | 
 28 | def dynamic_range_decompression_torch(x, C=1):
 29 |     """
 30 |     PARAMS
 31 |     ------
 32 |     C: compression factor used to compress
 33 |     """
 34 |     return torch.exp(x) / C
 35 | 
 36 | 
 37 | def spectral_normalize_torch(magnitudes):
 38 |     output = dynamic_range_compression_torch(magnitudes)
 39 |     return output
 40 | 
 41 | 
 42 | def spectral_de_normalize_torch(magnitudes):
 43 |     output = dynamic_range_decompression_torch(magnitudes)
 44 |     return output
 45 | 
 46 | 
 47 | mel_basis = {}
 48 | hann_window = {}
 49 | 
 50 | 
 51 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
 52 |     if torch.min(y) < -1.0:
 53 |         print("min value is ", torch.min(y))
 54 |     if torch.max(y) > 1.0:
 55 |         print("max value is ", torch.max(y))
 56 | 
 57 |     global hann_window
 58 |     dtype_device = str(y.dtype) + "_" + str(y.device)
 59 |     wnsize_dtype_device = str(win_size) + "_" + dtype_device
 60 |     if wnsize_dtype_device not in hann_window:
 61 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
 62 |             dtype=y.dtype, device=y.device
 63 |         )
 64 | 
 65 |     y = torch.nn.functional.pad(
 66 |         y.unsqueeze(1),
 67 |         (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
 68 |         mode="reflect",
 69 |     )
 70 |     y = y.squeeze(1)
 71 | 
 72 |     spec = torch.stft(
 73 |         y,
 74 |         n_fft,
 75 |         hop_length=hop_size,
 76 |         win_length=win_size,
 77 |         window=hann_window[wnsize_dtype_device],
 78 |         center=center,
 79 |         pad_mode="reflect",
 80 |         normalized=False,
 81 |         onesided=True,
 82 |         return_complex=True,
 83 |     )
 84 |     spec = torch.view_as_real(spec)
 85 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
 86 |     return spec
 87 | 
 88 | 
 89 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
 90 |     global mel_basis
 91 |     dtype_device = str(spec.dtype) + "_" + str(spec.device)
 92 |     fmax_dtype_device = str(fmax) + "_" + dtype_device
 93 |     if fmax_dtype_device not in mel_basis:
 94 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
 95 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
 96 |             dtype=spec.dtype, device=spec.device
 97 |         )
 98 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
 99 |     spec = spectral_normalize_torch(spec)
100 |     return spec
101 | 
102 | 
103 | def mel_spectrogram_torch(
104 |     y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
105 | ):
106 |     if torch.min(y) < -1.0:
107 |         print("min value is ", torch.min(y))
108 |     if torch.max(y) > 1.0:
109 |         print("max value is ", torch.max(y))
110 | 
111 |     global mel_basis, hann_window
112 |     dtype_device = str(y.dtype) + "_" + str(y.device)
113 |     fmax_dtype_device = str(fmax) + "_" + dtype_device
114 |     wnsize_dtype_device = str(win_size) + "_" + dtype_device
115 |     if fmax_dtype_device not in mel_basis:
116 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
117 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
118 |             dtype=y.dtype, device=y.device
119 |         )
120 |     if wnsize_dtype_device not in hann_window:
121 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
122 |             dtype=y.dtype, device=y.device
123 |         )
124 | 
125 |     y = torch.nn.functional.pad(
126 |         y.unsqueeze(1),
127 |         (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
128 |         mode="reflect",
129 |     )
130 |     y = y.squeeze(1)
131 | 
132 |     spec = torch.stft(
133 |         y,
134 |         n_fft,
135 |         hop_length=hop_size,
136 |         win_length=win_size,
137 |         window=hann_window[wnsize_dtype_device],
138 |         center=center,
139 |         pad_mode="reflect",
140 |         normalized=False,
141 |         onesided=True,
142 |         return_complex=True,
143 |     )
144 |     spec = torch.view_as_real(spec)
145 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
146 | 
147 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
148 |     spec = spectral_normalize_torch(spec)
149 | 
150 |     return spec
151 | 


--------------------------------------------------------------------------------
/normalize_wav.py:
--------------------------------------------------------------------------------
 1 | from prepare.align_wav_spec import Align
 2 | import os
 3 | from tqdm import tqdm
 4 | 
 5 | align = Align(32768, 24000, 1024, 256, 1024)
 6 | output_path = "singing_gt"
 7 | input_path = "/home/yyu479/VISinger_data/wav_dump_24k"
 8 | 
 9 | files = os.listdir(path=input_path)
10 | for i, wav_file in enumerate(tqdm(files)):
11 |     suffix = os.path.splitext(os.path.split(wav_file)[-1])[1]
12 |     if not suffix == ".wav":
13 |         continue
14 |     basename = os.path.splitext(os.path.split(wav_file)[-1])[0][:-7]
15 |     align.normalize_wav(
16 |         os.path.join(input_path, wav_file), os.path.join(output_path, wav_file)
17 |     )
18 | 


--------------------------------------------------------------------------------
/plot_f0.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | import librosa
 5 | import librosa.display
 6 | 
 7 | from prepare.data_vits_phn import FeatureInput, SingInput
 8 | 
 9 | # setting
10 | hop_length = 256
11 | sample_rate = 24000
12 | wav_name = "2001000001"
13 | input_path = "singing_gt/2001000001_bits16.wav"
14 | 
15 | # get mel
16 | y, sr = librosa.load(input_path, sr=sample_rate)
17 | librosa.feature.melspectrogram(y=y, sr=sr)
18 | D = librosa.stft(y, hop_length=hop_length)  # STFT of y
19 | S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
20 | 
21 | # get f0
22 | featureInput = FeatureInput("singing_gt/", sr, hop_length)
23 | featur_pit = featureInput.compute_f0("2001000001_bits16.wav")
24 | 
25 | fo = open("../VISinger_data/transcriptions.txt", "r+")
26 | # load text info
27 | 
28 | while True:
29 |     try:
30 |         message = fo.readline().strip()
31 |     except Exception as e:
32 |         print("nothing of except:", e)
33 |     if message == None:
34 |         break
35 |     if message == "":
36 |         break
37 |     if wav_name in message:
38 |         break
39 | print(message)
40 | 
41 | infos = message.split("|")
42 | file = infos[0]
43 | hanz = infos[1]
44 | phon = infos[2].split(" ")
45 | note = infos[3].split(" ")
46 | note_dur = infos[4].split(" ")
47 | phon_dur = infos[5].split(" ")
48 | phon_slur = infos[6].split(" ")
49 | 
50 | 
51 | singInput = SingInput(sample_rate, hop_length)
52 | 
53 | (
54 |     file,
55 |     labels_ids,
56 |     labels_dur,
57 |     scores_ids,
58 |     scores_dur,
59 |     labels_slr,
60 |     labels_uvs,
61 | ) = singInput.parseInput(message)
62 | labels_uvs = np.repeat(labels_uvs, labels_dur, axis=0)
63 | featur_pit = featur_pit[: len(labels_uvs)]
64 | featur_pit_uv = featur_pit * labels_uvs
65 | 
66 | uv = featur_pit == 0
67 | featur_pit_intp = np.copy(featur_pit)
68 | featur_pit_intp[uv] = np.interp(np.where(uv)[0], np.where(~uv)[0], featur_pit[~uv])
69 | # plot
70 | # plt.figure()
71 | fig = plt.figure(figsize=(15, 6))
72 | 
73 | librosa.display.specshow(
74 |     S_db, y_axis="log", sr=sr, hop_length=hop_length, x_axis="frames"
75 | )
76 | 
77 | (F0_ori,) = plt.plot(featur_pit.T, "r", label="F0_ori", alpha=0.9)
78 | (F0_uv,) = plt.plot(featur_pit_uv.T, "y", label="F0_uv", alpha=0.9)
79 | (F0_intp,) = plt.plot(featur_pit_intp.T, "b", label="F0_intp", alpha=0.9)
80 | plt.legend([F0_ori, F0_uv, F0_intp], ["F0_ori", "F0_uv", "F0_intp"], loc="upper right")
81 | plt.colorbar(format="%+2.0f dB")
82 | plt.savefig("f0.png")
83 | 


--------------------------------------------------------------------------------
/prepare/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/prepare/__init__.py


--------------------------------------------------------------------------------
/prepare/align_wav_spec.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | import torch.utils.data
 4 | 
 5 | from mel_processing import spectrogram_torch
 6 | from utils import load_wav_to_torch
 7 | import scipy.io.wavfile as sciwav
 8 | import os
 9 | 
10 | 
11 | class Align:
12 |     def __init__(
13 |         self, max_wav_value, sampling_rate, filter_length, hop_length, win_length
14 |     ):
15 |         self.max_wav_value = max_wav_value
16 |         self.sampling_rate = sampling_rate
17 |         self.filter_length = filter_length
18 |         self.hop_length = hop_length
19 |         self.win_length = win_length
20 | 
21 |     def align_wav_spec(self, filename, phone_dur):
22 |         phone_dur = np.int32(phone_dur)
23 |         phone_dur = torch.Tensor(phone_dur).to(torch.int32)
24 |         audio, sampling_rate = load_wav_to_torch(filename)
25 |         if sampling_rate != self.sampling_rate:
26 |             raise ValueError(
27 |                 "{} SR doesn't match target {} SR".format(
28 |                     sampling_rate, self.sampling_rate
29 |                 )
30 |             )
31 |         audio_norm = audio / self.max_wav_value
32 |         audio_norm = audio_norm.unsqueeze(0)
33 |         spec_filename = filename.replace(".wav", ".spec.pt")
34 |         if os.path.exists(spec_filename):
35 |             spec = torch.load(spec_filename)
36 |         else:
37 |             spec = spectrogram_torch(
38 |                 audio_norm,
39 |                 self.filter_length,
40 |                 self.sampling_rate,
41 |                 self.hop_length,
42 |                 self.win_length,
43 |                 center=False,
44 |             )
45 |             # align mel and wave
46 |             phone_dur_sum = torch.sum(phone_dur).item()
47 |             spec_length = spec.shape[2]
48 | 
49 |             if spec_length > phone_dur_sum:
50 |                 spec = spec[:, :, :phone_dur_sum]
51 |             elif spec_length < phone_dur_sum:
52 |                 pad_length = phone_dur_sum - spec_length
53 |                 spec = torch.nn.functional.pad(
54 |                     input=spec, pad=(0, pad_length, 0, 0), mode="constant", value=0
55 |                 )
56 |             assert spec.shape[2] == phone_dur_sum
57 | 
58 |             # align wav
59 |             fixed_wav_len = phone_dur_sum * self.hop_length
60 |             if audio_norm.shape[1] > fixed_wav_len:
61 |                 audio_norm = audio_norm[:, :fixed_wav_len]
62 |             elif audio_norm.shape[1] < fixed_wav_len:
63 |                 pad_length = fixed_wav_len - audio_norm.shape[1]
64 |                 audio_norm = torch.nn.functional.pad(
65 |                     input=audio_norm,
66 |                     pad=(0, pad_length, 0, 0),
67 |                     mode="constant",
68 |                     value=0,
69 |                 )
70 |             assert audio_norm.shape[1] == fixed_wav_len
71 | 
72 |             # rewrite aligned wav
73 |             audio = (
74 |                 (audio_norm * self.max_wav_value)
75 |                 .transpose(0, 1)
76 |                 .numpy()
77 |                 .astype(np.int16)
78 |             )
79 | 
80 |             sciwav.write(
81 |                 filename,
82 |                 self.sampling_rate,
83 |                 audio,
84 |             )
85 |             # save spec
86 |             spec = torch.squeeze(spec, 0)
87 |             torch.save(spec, spec_filename)
88 |         return spec.shape[1]
89 | 
90 |     def normalize_wav(self, input_path, output_path):
91 |         audio, sampling_rate = load_wav_to_torch(input_path)
92 |         audio_norm = audio.numpy() / self.max_wav_value
93 |         audio_norm *= 32767 / max(0.01, np.max(np.abs(audio_norm))) * 0.6
94 |         sciwav.write(
95 |             output_path,
96 |             sampling_rate,
97 |             audio_norm.astype(np.int16),
98 |         )
99 | 


--------------------------------------------------------------------------------
/prepare/data_vits.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import logging
  3 | import numpy as np
  4 | import librosa
  5 | import pyworld
  6 | 
  7 | from prepare.phone_map import label_to_ids
  8 | from prepare.phone_uv import uv_map
  9 | 
 10 | 
 11 | def load_midi_map():
 12 |     notemap = {}
 13 |     notemap["rest"] = 0
 14 |     fo = open("./prepare/midi-note.scp", "r+")
 15 |     while True:
 16 |         try:
 17 |             message = fo.readline().strip()
 18 |         except Exception as e:
 19 |             print("nothing of except:", e)
 20 |             break
 21 |         if message == None:
 22 |             break
 23 |         if message == "":
 24 |             break
 25 |         infos = message.split()
 26 |         notemap[infos[1]] = int(infos[0])
 27 |     fo.close()
 28 |     return notemap
 29 | 
 30 | 
 31 | class SingInput(object):
 32 |     def __init__(self, samplerate=16000, hop_size=128):
 33 |         self.fs = samplerate
 34 |         self.hop = hop_size
 35 |         self.notemaper = load_midi_map()
 36 | 
 37 |     def phone_to_uv(self, phones):
 38 |         uv = []
 39 |         for phone in phones:
 40 |             uv.append(uv_map[phone.lower()])
 41 |         return uv
 42 | 
 43 |     def notes_to_id(self, notes):
 44 |         note_ids = []
 45 |         for note in notes:
 46 |             note_ids.append(self.notemaper[note])
 47 |         return note_ids
 48 | 
 49 |     def frame_duration(self, durations):
 50 |         ph_durs = [float(x) for x in durations]
 51 |         sentence_length = 0
 52 |         for ph_dur in ph_durs:
 53 |             sentence_length = sentence_length + ph_dur
 54 |         sentence_length = int(sentence_length * self.fs / self.hop + 0.5)
 55 | 
 56 |         sample_frame = []
 57 |         startTime = 0
 58 |         for i_ph in range(len(ph_durs)):
 59 |             start_frame = int(startTime * self.fs / self.hop + 0.5)
 60 |             end_frame = int((startTime + ph_durs[i_ph]) * self.fs / self.hop + 0.5)
 61 |             count_frame = end_frame - start_frame
 62 |             sample_frame.append(count_frame)
 63 |             startTime = startTime + ph_durs[i_ph]
 64 |         all_frame = np.sum(sample_frame)
 65 |         assert all_frame == sentence_length
 66 |         # match mel length
 67 |         sample_frame[-1] = sample_frame[-1] - 1
 68 |         return sample_frame
 69 | 
 70 |     def score_duration(self, durations):
 71 |         ph_durs = [float(x) for x in durations]
 72 |         sample_frame = []
 73 |         for i_ph in range(len(ph_durs)):
 74 |             count_frame = int(ph_durs[i_ph] * self.fs / self.hop + 0.5)
 75 |             if count_frame >= 256:
 76 |                 print("count_frame", count_frame)
 77 |                 count_frame = 255
 78 |             sample_frame.append(count_frame)
 79 |         return sample_frame
 80 | 
 81 |     def parseInput(self, singinfo: str):
 82 |         infos = singinfo.split("|")
 83 |         file = infos[0]
 84 |         # hanz = infos[1]
 85 |         phon = infos[2].split(" ")
 86 |         note = infos[3].split(" ")
 87 |         note_dur = infos[4].split(" ")
 88 |         phon_dur = infos[5].split(" ")
 89 |         phon_slr = infos[6].split(" ")
 90 | 
 91 |         labels_ids = label_to_ids(phon)
 92 |         labels_uvs = self.phone_to_uv(phon)
 93 |         labels_frames = self.frame_duration(phon_dur)
 94 |         scores_ids = self.notes_to_id(note)
 95 |         scores_dur = self.score_duration(note_dur)
 96 |         labels_slr = [int(x) for x in phon_slr]
 97 |         return (
 98 |             file,
 99 |             labels_ids,
100 |             labels_frames,
101 |             scores_ids,
102 |             scores_dur,
103 |             labels_slr,
104 |             labels_uvs,
105 |         )
106 | 
107 |     def parseSong(self, singinfo: str):
108 |         infos = singinfo.split("|")
109 |         item_indx = infos[0]
110 |         item_time = infos[1]
111 |         # hanz = infos[2]
112 |         phon = infos[3].split(" ")
113 |         note_ids = infos[4].split(" ")
114 |         note_dur = infos[5].split(" ")
115 |         phon_dur = infos[6].split(" ")
116 |         phon_slr = infos[7].split(" ")
117 | 
118 |         labels_ids = label_to_ids(phon)
119 |         labels_uvs = self.phone_to_uv(phon)
120 |         labels_frames = self.frame_duration(phon_dur)
121 |         scores_ids = [int(x) if x != "rest" else 0 for x in note_ids]
122 |         scores_dur = self.score_duration(note_dur)
123 |         labels_slr = [int(x) for x in phon_slr]
124 |         return (
125 |             item_indx,
126 |             item_time,
127 |             labels_ids,
128 |             labels_frames,
129 |             scores_ids,
130 |             scores_dur,
131 |             labels_slr,
132 |             labels_uvs,
133 |         )
134 | 
135 |     def expandInput(self, labels_ids, labels_frames):
136 |         assert len(labels_ids) == len(labels_frames)
137 |         frame_num = np.sum(labels_frames)
138 |         frame_labels = np.zeros(frame_num, dtype=np.int)
139 |         start = 0
140 |         for index, num in enumerate(labels_frames):
141 |             frame_labels[start : start + num] = labels_ids[index]
142 |             start += num
143 |         return frame_labels
144 | 
145 |     def scorePitch(self, scores_id):
146 |         score_pitch = np.zeros(len(scores_id), dtype=np.float)
147 |         for index, score_id in enumerate(scores_id):
148 |             if score_id == 0:
149 |                 score_pitch[index] = 0
150 |             else:
151 |                 pitch = librosa.midi_to_hz(score_id)
152 |                 score_pitch[index] = round(pitch, 1)
153 |         return score_pitch
154 | 
155 |     def smoothPitch(self, pitch):
156 |         # 使用卷积对数据平滑
157 |         kernel = np.hanning(5)  # 随机生成一个卷积核（对称的）
158 |         kernel /= kernel.sum()
159 |         smooth_pitch = np.convolve(pitch, kernel, "same")
160 |         return smooth_pitch
161 | 
162 | 
163 | class FeatureInput(object):
164 |     def __init__(self, path, samplerate=16000, hop_size=128):
165 |         self.fs = samplerate
166 |         self.hop = hop_size
167 |         self.path = path
168 | 
169 |         self.f0_bin = 256
170 |         self.f0_max = 1100.0
171 |         self.f0_min = 50.0
172 |         self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700)
173 |         self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
174 | 
175 |     def compute_f0(self, filename):
176 |         x, sr = librosa.load(self.path + filename, self.fs)
177 |         assert sr == self.fs
178 |         f0, t = pyworld.dio(
179 |             x.astype(np.double),
180 |             fs=sr,
181 |             f0_ceil=800,
182 |             frame_period=1000 * self.hop / sr,
183 |         )
184 |         f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs)
185 |         for index, pitch in enumerate(f0):
186 |             f0[index] = round(pitch, 1)
187 |         return f0
188 | 
189 |     def coarse_f0(self, f0):
190 |         f0_mel = 1127 * np.log(1 + f0 / 700)
191 |         f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * (
192 |             self.f0_bin - 2
193 |         ) / (self.f0_mel_max - self.f0_mel_min) + 1
194 | 
195 |         # use 0 or 1
196 |         f0_mel[f0_mel <= 1] = 1
197 |         f0_mel[f0_mel > self.f0_bin - 1] = self.f0_bin - 1
198 |         f0_coarse = np.rint(f0_mel).astype(np.int)
199 |         assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, (
200 |             f0_coarse.max(),
201 |             f0_coarse.min(),
202 |         )
203 |         return f0_coarse
204 | 
205 |     def diff_f0(self, scores_pit, featur_pit, labels_frames):
206 |         length_pit = min(len(scores_pit), len(featur_pit))
207 |         offset_pit = np.zeros(length_pit, dtype=np.int)
208 |         for idx in range(length_pit):
209 |             s_pit = scores_pit[idx]
210 |             f_pit = featur_pit[idx]
211 |             if s_pit == 0 or f_pit == 0:
212 |                 offset_pit[idx] = 0
213 |             else:
214 |                 tmp = int(f_pit - s_pit)
215 |                 tmp = +128 if tmp > +128 else tmp
216 |                 tmp = -127 if tmp < -127 else tmp
217 |                 tmp = 256 + tmp if tmp < 0 else tmp
218 |                 offset_pit[idx] = tmp
219 |         offset_pit[offset_pit > 255] = 255
220 |         offset_pit[offset_pit < 0] = 0
221 |         # start = 0
222 |         # for num in labels_frames:
223 |         #     print("---------------------------------------------")
224 |         #     print(scores_pit[start:start+num])
225 |         #     print(featur_pit[start:start+num])
226 |         #     print(offset_pit[start:start+num])
227 |         #     start += num
228 |         return offset_pit
229 | 
230 | 
231 | if __name__ == "__main__":
232 |     logging.basicConfig(level=logging.INFO)  # ERROR & INFO
233 | 
234 |     notemaper = load_midi_map()
235 |     logging.info(notemaper)
236 | 
237 |     singInput = SingInput(16000, 256)
238 |     featureInput = FeatureInput("../VISinger_data/wav_dump_16k/", 16000, 256)
239 | 
240 |     if not os.path.exists("../VISinger_data/label_vits"):
241 |         os.mkdir("../VISinger_data/label_vits")
242 | 
243 |     fo = open("../VISinger_data/transcriptions.txt", "r+")
244 |     vits_file = open("./filelists/vits_file.txt", "w", encoding="utf-8")
245 |     i = 0
246 |     all_txt = []  # 统计非重复的句子个数
247 |     while True:
248 |         try:
249 |             message = fo.readline().strip()
250 |         except Exception as e:
251 |             print("nothing of except:", e)
252 |             break
253 |         if message == None:
254 |             break
255 |         if message == "":
256 |             break
257 |         i = i + 1
258 |         # if i > 5:
259 |         #     exit()
260 |         infos = message.split("|")
261 |         file = infos[0]
262 |         hanz = infos[1]
263 |         all_txt.append(hanz)
264 |         phon = infos[2].split(" ")
265 |         note = infos[3].split(" ")
266 |         note_dur = infos[4].split(" ")
267 |         phon_dur = infos[5].split(" ")
268 |         phon_slur = infos[6].split(" ")
269 | 
270 |         logging.info("----------------------------")
271 |         logging.info(file)
272 |         logging.info(hanz)
273 |         logging.info(phon)
274 |         # logging.info(note_dur)
275 |         # logging.info(phon_dur)
276 |         # logging.info(phon_slur)
277 | 
278 |         (
279 |             file,
280 |             labels_ids,
281 |             labels_frames,
282 |             scores_ids,
283 |             scores_dur,
284 |             labels_slr,
285 |             labels_uvs,
286 |         ) = singInput.parseInput(message)
287 |         labels_ids = singInput.expandInput(labels_ids, labels_frames)
288 |         labels_uvs = singInput.expandInput(labels_uvs, labels_frames)
289 |         labels_slr = singInput.expandInput(labels_slr, labels_frames)
290 |         scores_ids = singInput.expandInput(scores_ids, labels_frames)
291 |         scores_pit = singInput.scorePitch(scores_ids)
292 |         featur_pit = featureInput.compute_f0(f"{file}_bits16.wav")
293 |         featur_pit = featur_pit[: len(labels_ids)]
294 |         featur_pit = featur_pit * labels_uvs
295 |         coarse_pit = featureInput.coarse_f0(featur_pit)
296 | 
297 |         # offset_pit = featureInput.diff_f0(scores_pit, featur_pit, labels_frames)
298 |         assert len(labels_ids) == len(coarse_pit)
299 | 
300 |         logging.info(labels_ids)
301 |         logging.info(scores_ids)
302 |         logging.info(coarse_pit)
303 |         logging.info(labels_slr)
304 | 
305 |         np.save(
306 |             f"../VISinger_data/label_vits/{file}_label.npy",
307 |             labels_ids,
308 |             allow_pickle=False,
309 |         )
310 |         np.save(
311 |             f"../VISinger_data/label_vits/{file}_score.npy",
312 |             scores_ids,
313 |             allow_pickle=False,
314 |         )
315 |         np.save(
316 |             f"../VISinger_data/label_vits/{file}_pitch.npy",
317 |             coarse_pit,
318 |             allow_pickle=False,
319 |         )
320 |         np.save(
321 |             f"../VISinger_data/label_vits/{file}_slurs.npy",
322 |             labels_slr,
323 |             allow_pickle=False,
324 |         )
325 | 
326 |         # wave path|label path|label frame|score path|score duration;上面是一个.（当前目录），下面是两个..（从子目录调用）
327 |         path_wave = f"../VISinger_data/wav_dump_16k/{file}_bits16.wav"
328 |         path_label = f"../VISinger_data/label_vits/{file}_label.npy"
329 |         path_score = f"../VISinger_data/label_vits/{file}_score.npy"
330 |         path_pitch = f"../VISinger_data/label_vits/{file}_pitch.npy"
331 |         path_slurs = f"../VISinger_data/label_vits/{file}_slurs.npy"
332 |         print(
333 |             f"{path_wave}|{path_label}|{path_score}|{path_pitch}|{path_slurs}",
334 |             file=vits_file,
335 |         )
336 | 
337 |     fo.close()
338 |     vits_file.close()
339 |     print(len(set(all_txt)))  # 统计非重复的句子个数
340 | 


--------------------------------------------------------------------------------
/prepare/data_vits_phn.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import logging
  3 | import numpy as np
  4 | import librosa
  5 | import pyworld
  6 | 
  7 | from prepare.phone_map import label_to_ids
  8 | from prepare.phone_uv import uv_map
  9 | from prepare.dur_to_frame import dur_to_frame
 10 | from prepare.align_wav_spec import Align
 11 | 
 12 | 
 13 | def load_midi_map():
 14 |     notemap = {}
 15 |     notemap["rest"] = 0
 16 |     fo = open("./prepare/midi-note.scp", "r+")
 17 |     while True:
 18 |         try:
 19 |             message = fo.readline().strip()
 20 |         except Exception as e:
 21 |             print("nothing of except:", e)
 22 |             break
 23 |         if message == None:
 24 |             break
 25 |         if message == "":
 26 |             break
 27 |         infos = message.split()
 28 |         notemap[infos[1]] = int(infos[0])
 29 |     fo.close()
 30 |     return notemap
 31 | 
 32 | 
 33 | class SingInput(object):
 34 |     def __init__(self, sample_rate=24000, hop_size=256):
 35 |         self.fs = sample_rate
 36 |         self.hop = hop_size
 37 |         self.notemaper = load_midi_map()
 38 |         self.align = Align(32768, sample_rate, 1024, hop_size, 1024)
 39 | 
 40 |     def phone_to_uv(self, phones):
 41 |         uv = []
 42 |         for phone in phones:
 43 |             uv.append(uv_map[phone.lower()])
 44 |         return uv
 45 | 
 46 |     def notes_to_id(self, notes):
 47 |         note_ids = []
 48 |         for note in notes:
 49 |             note_ids.append(self.notemaper[note])
 50 |         return note_ids
 51 | 
 52 |     def frame_duration(self, durations):
 53 |         ph_durs = [float(x) for x in durations]
 54 |         sentence_length = 0
 55 |         for ph_dur in ph_durs:
 56 |             sentence_length = sentence_length + ph_dur
 57 |         sentence_length = int(sentence_length * self.fs / self.hop + 0.5)
 58 | 
 59 |         sample_frame = []
 60 |         startTime = 0
 61 |         for i_ph in range(len(ph_durs)):
 62 |             start_frame = int(startTime * self.fs / self.hop + 0.5)
 63 |             end_frame = int((startTime + ph_durs[i_ph]) * self.fs / self.hop + 0.5)
 64 |             count_frame = end_frame - start_frame
 65 |             sample_frame.append(count_frame)
 66 |             startTime = startTime + ph_durs[i_ph]
 67 |         all_frame = np.sum(sample_frame)
 68 |         assert all_frame == sentence_length
 69 |         # match mel length
 70 |         sample_frame[-1] = sample_frame[-1] - 1
 71 |         return sample_frame
 72 | 
 73 |     def score_duration(self, durations):
 74 |         ph_durs = [float(x) for x in durations]
 75 |         sample_frame = []
 76 |         for i_ph in range(len(ph_durs)):
 77 |             count_frame = int(ph_durs[i_ph] * self.fs / self.hop + 0.5)
 78 |             if count_frame >= 256:
 79 |                 print("count_frame", count_frame)
 80 |                 count_frame = 255
 81 |             sample_frame.append(count_frame)
 82 |         return sample_frame
 83 | 
 84 |     def parseInput(self, singinfo: str):
 85 |         infos = singinfo.split("|")
 86 |         file = infos[0]
 87 |         # hanz = infos[1]
 88 |         phon = infos[2].split(" ")
 89 |         note = infos[3].split(" ")
 90 |         note_dur = infos[4].split(" ")
 91 |         phon_dur = infos[5].split(" ")
 92 |         phon_slr = infos[6].split(" ")
 93 | 
 94 |         labels_ids = label_to_ids(phon)
 95 |         labels_uvs = self.phone_to_uv(phon)
 96 |         note_ids = self.notes_to_id(note)
 97 |         # convert into float
 98 |         note_dur = [eval(i) for i in note_dur]
 99 |         phon_dur = [eval(i) for i in phon_dur]
100 | 
101 |         note_dur = dur_to_frame(note_dur, self.fs, self.hop)
102 |         phon_dur = dur_to_frame(phon_dur, self.fs, self.hop)
103 |         labels_slr = [int(x) for x in phon_slr]
104 | 
105 |         # print("labels_ids", labels_ids)
106 |         # print("note_dur", note_dur)
107 |         # print("phon_dur", phon_dur)
108 |         # print("labels_slr", labels_slr)
109 |         return (
110 |             file,
111 |             labels_ids,
112 |             phon_dur,
113 |             note_ids,
114 |             note_dur,
115 |             labels_slr,
116 |             labels_uvs,
117 |         )
118 | 
119 |     def parseSong(self, singinfo: str):
120 |         infos = singinfo.split("|")
121 |         item_indx = infos[0]
122 |         item_time = infos[1]
123 |         # hanz = infos[2]
124 |         phon = infos[3].split(" ")
125 |         note_ids = infos[4].split(" ")
126 |         note_dur = infos[5].split(" ")
127 |         phon_dur = infos[6].split(" ")
128 |         phon_slr = infos[7].split(" ")
129 | 
130 |         labels_ids = label_to_ids(phon)
131 |         labels_uvs = self.phone_to_uv(phon)
132 |         labels_frames = self.frame_duration(phon_dur)
133 |         scores_ids = [int(x) if x != "rest" else 0 for x in note_ids]
134 |         scores_dur = self.score_duration(note_dur)
135 |         labels_slr = [int(x) for x in phon_slr]
136 |         return (
137 |             item_indx,
138 |             item_time,
139 |             labels_ids,
140 |             labels_frames,
141 |             scores_ids,
142 |             scores_dur,
143 |             labels_slr,
144 |             labels_uvs,
145 |         )
146 | 
147 |     def expandInput(self, labels_ids, labels_frames):
148 |         assert len(labels_ids) == len(labels_frames)
149 |         frame_num = np.sum(labels_frames)
150 |         frame_labels = np.zeros(frame_num, dtype=np.int)
151 |         start = 0
152 |         for index, num in enumerate(labels_frames):
153 |             frame_labels[start : start + num] = labels_ids[index]
154 |             start += num
155 |         return frame_labels
156 | 
157 |     def scorePitch(self, scores_id):
158 |         score_pitch = np.zeros(len(scores_id), dtype=np.float)
159 |         for index, score_id in enumerate(scores_id):
160 |             if score_id == 0:
161 |                 score_pitch[index] = 0
162 |             else:
163 |                 pitch = librosa.midi_to_hz(score_id)
164 |                 score_pitch[index] = round(pitch, 1)
165 |         return score_pitch
166 | 
167 |     def smoothPitch(self, pitch):
168 |         # 使用卷积对数据平滑
169 |         kernel = np.hanning(5)  # 随机生成一个卷积核（对称的）
170 |         kernel /= kernel.sum()
171 |         smooth_pitch = np.convolve(pitch, kernel, "same")
172 |         return smooth_pitch
173 | 
174 |     def align_process(self, file, phn_dur):
175 |         return self.align.align_wav_spec(file, phn_dur)
176 | 
177 | 
178 | class FeatureInput(object):
179 |     def __init__(self, path, samplerate=24000, hop_size=256):
180 |         self.fs = samplerate
181 |         self.hop = hop_size
182 |         self.path = path
183 | 
184 |         self.f0_bin = 256
185 |         self.f0_max = 1100.0
186 |         self.f0_min = 50.0
187 |         self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700)
188 |         self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
189 | 
190 |     def compute_f0(self, filename):
191 |         x, sr = librosa.load(self.path + filename, self.fs)
192 |         assert sr == self.fs
193 |         f0, t = pyworld.dio(
194 |             x.astype(np.double),
195 |             fs=sr,
196 |             f0_ceil=800,
197 |             frame_period=1000 * self.hop / sr,
198 |         )
199 |         f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs)
200 |         for index, pitch in enumerate(f0):
201 |             f0[index] = round(pitch, 1)
202 |         return f0
203 | 
204 |     def coarse_f0(self, f0):
205 |         f0_mel = 1127 * np.log(1 + f0 / 700)
206 |         f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * (
207 |             self.f0_bin - 2
208 |         ) / (self.f0_mel_max - self.f0_mel_min) + 1
209 | 
210 |         # use 0 or 1
211 |         f0_mel[f0_mel <= 1] = 1
212 |         f0_mel[f0_mel > self.f0_bin - 1] = self.f0_bin - 1
213 |         f0_coarse = np.rint(f0_mel).astype(np.int)
214 |         assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, (
215 |             f0_coarse.max(),
216 |             f0_coarse.min(),
217 |         )
218 |         return f0_coarse
219 | 
220 |     def diff_f0(self, scores_pit, featur_pit, labels_frames):
221 |         length_pit = min(len(scores_pit), len(featur_pit))
222 |         offset_pit = np.zeros(length_pit, dtype=np.int)
223 |         for idx in range(length_pit):
224 |             s_pit = scores_pit[idx]
225 |             f_pit = featur_pit[idx]
226 |             if s_pit == 0 or f_pit == 0:
227 |                 offset_pit[idx] = 0
228 |             else:
229 |                 tmp = int(f_pit - s_pit)
230 |                 tmp = +128 if tmp > +128 else tmp
231 |                 tmp = -127 if tmp < -127 else tmp
232 |                 tmp = 256 + tmp if tmp < 0 else tmp
233 |                 offset_pit[idx] = tmp
234 |         offset_pit[offset_pit > 255] = 255
235 |         offset_pit[offset_pit < 0] = 0
236 |         # start = 0
237 |         # for num in labels_frames:
238 |         #     print("---------------------------------------------")
239 |         #     print(scores_pit[start:start+num])
240 |         #     print(featur_pit[start:start+num])
241 |         #     print(offset_pit[start:start+num])
242 |         #     start += num
243 |         return offset_pit
244 | 
245 | 
246 | if __name__ == "__main__":
247 |     output_path = "../VISinger_data/label_vits_phn/"
248 |     wav_path = "../VISinger_data/wav_dump_24k/"
249 |     logging.basicConfig(level=logging.INFO)  # ERROR & INFO
250 |     pitch_norm = True
251 |     pitch_intp = True
252 |     uv_process = False
253 | 
254 |     notemaper = load_midi_map()
255 |     logging.info(notemaper)
256 | 
257 |     sample_rate = 24000
258 |     hop_size = 256
259 |     singInput = SingInput(sample_rate, hop_size)
260 |     featureInput = FeatureInput(wav_path, sample_rate, hop_size)
261 | 
262 |     if not os.path.exists(output_path):
263 |         os.mkdir(output_path)
264 | 
265 |     fo = open("../VISinger_data/transcriptions.txt", "r+")
266 |     # vits_file = open("./filelists/vits_file_phn.txt", "w", encoding="utf-8")
267 |     vits_file = open("./filelists/vits_file.txt", "w", encoding="utf-8")
268 |     i = 0
269 |     all_txt = []  # 统计非重复的句子个数
270 |     while True:
271 |         try:
272 |             message = fo.readline().strip()
273 |         except Exception as e:
274 |             print("nothing of except:", e)
275 |             break
276 |         if message == None:
277 |             break
278 |         if message == "":
279 |             break
280 |         i = i + 1
281 |         # if i > 5:
282 |         #     exit()
283 |         infos = message.split("|")
284 |         file = infos[0]
285 |         hanz = infos[1]
286 |         all_txt.append(hanz)
287 |         phon = infos[2].split(" ")
288 |         note = infos[3].split(" ")
289 |         note_dur = infos[4].split(" ")
290 |         phon_dur = infos[5].split(" ")
291 |         phon_slur = infos[6].split(" ")
292 | 
293 |         logging.info("----------------------------")
294 |         logging.info("file {}".format(file))
295 |         logging.info("lyrics {}".format(hanz))
296 |         logging.info("phn {}".format(phon))
297 |         # logging.info(note_dur)
298 |         # logging.info(phon_dur)
299 |         # logging.info(phon_slur)
300 | 
301 |         (
302 |             file,
303 |             labels_ids,
304 |             labels_dur,
305 |             scores_ids,
306 |             scores_dur,
307 |             labels_slr,
308 |             labels_uvs,
309 |         ) = singInput.parseInput(message)
310 |         # labels_ids = singInput.expandInput(labels_ids, labels_frames)
311 |         # labels_uvs = singInput.expandInput(labels_uvs, labels_frames)
312 |         # labels_slr = singInput.expandInput(labels_slr, labels_frames)
313 |         # scores_ids = singInput.expandInput(scores_ids, labels_frames)
314 |         # scores_pit = singInput.scorePitch(scores_ids)
315 |         featur_pit = featureInput.compute_f0(f"{file}_bits16.wav")
316 |         wav_file = os.path.join(wav_path, file + "_bits16.wav")
317 | 
318 |         spec_len = singInput.align_process(wav_file, labels_dur)
319 | 
320 |         # extend uv
321 |         labels_uvs = np.repeat(labels_uvs, labels_dur, axis=0)
322 | 
323 |         featur_pit = featur_pit[:spec_len]
324 | 
325 |         if featur_pit.shape[0] < spec_len:
326 |             pad_length = spec_len - featur_pit.shape[0]
327 |             featur_pit = np.pad(featur_pit, pad_width=(0, pad_length), mode="constant")
328 |         assert featur_pit.shape[0] == spec_len
329 |         if uv_process:
330 |             featur_pit = featur_pit * labels_uvs
331 |         coarse_pit = featureInput.coarse_f0(featur_pit)
332 | 
333 |         # log f0
334 |         if not pitch_norm:
335 |             nonzero_idxs = np.where(featur_pit != 0)[0]
336 |             featur_pit[nonzero_idxs] = np.log(featur_pit[nonzero_idxs])
337 |         else:
338 |             featur_pit = 2595.0 * np.log10(1.0 + featur_pit / 700.0) / 500
339 | 
340 |         if pitch_intp:
341 |             uv = featur_pit == 0
342 |             featur_pit_intp = np.copy(featur_pit)
343 |             featur_pit_intp[uv] = np.interp(
344 |                 np.where(uv)[0], np.where(~uv)[0], featur_pit[~uv]
345 |             )
346 | 
347 |         # offset_pit = featureInput.diff_f0(scores_pit, featur_pit, labels_frames)
348 |         # assert len(labels_ids) == len(coarse_pit)
349 |         assert len(labels_ids) == len(labels_dur)
350 |         assert len(labels_dur) == len(scores_ids)
351 |         assert len(scores_ids) == len(scores_dur)
352 |         assert len(scores_dur) == len(labels_slr)
353 | 
354 |         logging.info("labels_ids {}".format(labels_ids))
355 |         # logging.info("labels_dur {}".format(labels_dur))
356 |         # logging.info("scores_ids {}".format(scores_ids))
357 |         # logging.info("scores_dur {}".format(scores_dur))
358 |         # logging.info("labels_slr {}".format(labels_slr))
359 |         # logging.info("labels_uvs {}".format(labels_uvs))
360 |         # logging.info("featur_pit {}".format(featur_pit))
361 |         logging.info("featur_pit_intp {}".format(featur_pit_intp))
362 | 
363 |         np.save(
364 |             output_path + f"{file}_label.npy",
365 |             labels_ids,
366 |             allow_pickle=False,
367 |         )
368 |         np.save(
369 |             output_path + f"{file}_label_dur.npy",
370 |             labels_dur,
371 |             allow_pickle=False,
372 |         )
373 |         np.save(
374 |             output_path + f"{file}_score.npy",
375 |             scores_ids,
376 |             allow_pickle=False,
377 |         )
378 |         np.save(
379 |             output_path + f"{file}_score_dur.npy",
380 |             scores_dur,
381 |             allow_pickle=False,
382 |         )
383 |         if not pitch_intp:
384 |             np.save(
385 |                 output_path + f"{file}_pitch.npy",
386 |                 featur_pit,
387 |                 allow_pickle=False,
388 |             )
389 |         else:
390 |             np.save(
391 |                 output_path + f"{file}_pitch.npy",
392 |                 featur_pit_intp,
393 |                 allow_pickle=False,
394 |             )
395 |         # np.save(
396 |         #     output_path + f"{file}_pitch.npy",
397 |         #     coarse_pit,
398 |         #     allow_pickle=False,
399 |         # )
400 |         np.save(
401 |             output_path + f"{file}_slurs.npy",
402 |             labels_slr,
403 |             allow_pickle=False,
404 |         )
405 | 
406 |         # wave path|label path|label frame|score path|score duration;上面是一个.（当前目录），下面是两个..（从子目录调用）
407 |         path_wave = wav_path + f"{file}_bits16.wav"
408 |         path_label = output_path + f"{file}_label.npy"
409 |         path_label_dur = output_path + f"{file}_label_dur.npy"
410 |         path_score = output_path + f"{file}_score.npy"
411 |         path_score_dur = output_path + f"{file}_score_dur.npy"
412 |         path_pitch = output_path + f"{file}_pitch.npy"
413 |         path_slurs = output_path + f"{file}_slurs.npy"
414 |         print(
415 |             f"{path_wave}|{path_label}|{path_label_dur}|{path_score}|{path_score_dur}|{path_pitch}|{path_slurs}",
416 |             file=vits_file,
417 |         )
418 | 
419 |     fo.close()
420 |     vits_file.close()
421 |     print(len(set(all_txt)))  # 统计非重复的句子个数
422 | 


--------------------------------------------------------------------------------
/prepare/data_vits_phn_ofuton.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import logging
  3 | import numpy as np
  4 | import librosa
  5 | import pyworld
  6 | 
  7 | from prepare.phone_map import label_to_ids
  8 | from prepare.phone_uv import uv_map
  9 | from prepare.dur_to_frame import dur_to_frame
 10 | from prepare.align_wav_spec import Align
 11 | 
 12 | 
 13 | def load_midi_map():
 14 |     notemap = {}
 15 |     notemap["rest"] = 0
 16 |     fo = open("./prepare/midi-note.scp", "r+")
 17 |     while True:
 18 |         try:
 19 |             message = fo.readline().strip()
 20 |         except Exception as e:
 21 |             print("nothing of except:", e)
 22 |             break
 23 |         if message == None:
 24 |             break
 25 |         if message == "":
 26 |             break
 27 |         infos = message.split()
 28 |         notemap[infos[1]] = int(infos[0])
 29 |     fo.close()
 30 |     return notemap
 31 | 
 32 | 
 33 | class SingInput(object):
 34 |     def __init__(self, sample_rate=24000, hop_size=256):
 35 |         self.fs = sample_rate
 36 |         self.hop = hop_size
 37 |         self.notemaper = load_midi_map()
 38 |         self.align = Align(32768, sample_rate, 1024, hop_size, 1024)
 39 | 
 40 |     def phone_to_uv(self, phones):
 41 |         uv = []
 42 |         for phone in phones:
 43 |             uv.append(uv_map[phone.lower()])
 44 |         return uv
 45 | 
 46 |     def notes_to_id(self, notes):
 47 |         note_ids = []
 48 |         for note in notes:
 49 |             note_ids.append(self.notemaper[note])
 50 |         return note_ids
 51 | 
 52 |     def frame_duration(self, durations):
 53 |         ph_durs = [float(x) for x in durations]
 54 |         sentence_length = 0
 55 |         for ph_dur in ph_durs:
 56 |             sentence_length = sentence_length + ph_dur
 57 |         sentence_length = int(sentence_length * self.fs / self.hop + 0.5)
 58 | 
 59 |         sample_frame = []
 60 |         startTime = 0
 61 |         for i_ph in range(len(ph_durs)):
 62 |             start_frame = int(startTime * self.fs / self.hop + 0.5)
 63 |             end_frame = int((startTime + ph_durs[i_ph]) * self.fs / self.hop + 0.5)
 64 |             count_frame = end_frame - start_frame
 65 |             sample_frame.append(count_frame)
 66 |             startTime = startTime + ph_durs[i_ph]
 67 |         all_frame = np.sum(sample_frame)
 68 |         assert all_frame == sentence_length
 69 |         # match mel length
 70 |         sample_frame[-1] = sample_frame[-1] - 1
 71 |         return sample_frame
 72 | 
 73 |     def score_duration(self, durations):
 74 |         ph_durs = [float(x) for x in durations]
 75 |         sample_frame = []
 76 |         for i_ph in range(len(ph_durs)):
 77 |             count_frame = int(ph_durs[i_ph] * self.fs / self.hop + 0.5)
 78 |             if count_frame >= 256:
 79 |                 print("count_frame", count_frame)
 80 |                 count_frame = 255
 81 |             sample_frame.append(count_frame)
 82 |         return sample_frame
 83 | 
 84 |     def parseInput(self, singinfo: str):
 85 |         infos = singinfo.split("|")
 86 |         file = infos[0]
 87 |         # hanz = infos[1]
 88 |         phon = infos[2].split(" ")
 89 |         note = infos[3].split(" ")
 90 |         note_dur = infos[4].split(" ")
 91 |         phon_dur = infos[5].split(" ")
 92 |         phon_slr = infos[6].split(" ")
 93 | 
 94 |         labels_ids = label_to_ids(phon)
 95 |         # labels_uvs = self.phone_to_uv(phon)
 96 |         note_ids = self.notes_to_id(note)
 97 |         # convert into float
 98 |         note_dur = [eval(i) for i in note_dur]
 99 |         phon_dur = [eval(i) for i in phon_dur]
100 | 
101 |         note_dur = dur_to_frame(note_dur, self.fs, self.hop)
102 |         phon_dur = dur_to_frame(phon_dur, self.fs, self.hop)
103 |         labels_slr = [int(x) for x in phon_slr]
104 | 
105 |         # print("labels_ids", labels_ids)
106 |         # print("note_dur", note_dur)
107 |         # print("phon_dur", phon_dur)
108 |         # print("labels_slr", labels_slr)
109 |         return (
110 |             file,
111 |             labels_ids,
112 |             phon_dur,
113 |             note_ids,
114 |             note_dur,
115 |             labels_slr,
116 |             # labels_uvs,
117 |         )
118 | 
119 |     def parseSong(self, singinfo: str):
120 |         infos = singinfo.split("|")
121 |         item_indx = infos[0]
122 |         item_time = infos[1]
123 |         # hanz = infos[2]
124 |         phon = infos[3].split(" ")
125 |         note_ids = infos[4].split(" ")
126 |         note_dur = infos[5].split(" ")
127 |         phon_dur = infos[6].split(" ")
128 |         phon_slr = infos[7].split(" ")
129 | 
130 |         labels_ids = label_to_ids(phon)
131 |         # labels_uvs = self.phone_to_uv(phon)
132 |         labels_frames = self.frame_duration(phon_dur)
133 |         scores_ids = [int(x) if x != "rest" else 0 for x in note_ids]
134 |         scores_dur = self.score_duration(note_dur)
135 |         labels_slr = [int(x) for x in phon_slr]
136 |         return (
137 |             item_indx,
138 |             item_time,
139 |             labels_ids,
140 |             labels_frames,
141 |             scores_ids,
142 |             scores_dur,
143 |             labels_slr,
144 |             # labels_uvs,
145 |         )
146 | 
147 |     def expandInput(self, labels_ids, labels_frames):
148 |         assert len(labels_ids) == len(labels_frames)
149 |         frame_num = np.sum(labels_frames)
150 |         frame_labels = np.zeros(frame_num, dtype=np.int)
151 |         start = 0
152 |         for index, num in enumerate(labels_frames):
153 |             frame_labels[start : start + num] = labels_ids[index]
154 |             start += num
155 |         return frame_labels
156 | 
157 |     def scorePitch(self, scores_id):
158 |         score_pitch = np.zeros(len(scores_id), dtype=np.float)
159 |         for index, score_id in enumerate(scores_id):
160 |             if score_id == 0:
161 |                 score_pitch[index] = 0
162 |             else:
163 |                 pitch = librosa.midi_to_hz(score_id)
164 |                 score_pitch[index] = round(pitch, 1)
165 |         return score_pitch
166 | 
167 |     def smoothPitch(self, pitch):
168 |         # 使用卷积对数据平滑
169 |         kernel = np.hanning(5)  # 随机生成一个卷积核（对称的）
170 |         kernel /= kernel.sum()
171 |         smooth_pitch = np.convolve(pitch, kernel, "same")
172 |         return smooth_pitch
173 | 
174 |     def align_process(self, file, phn_dur):
175 |         return self.align.align_wav_spec(file, phn_dur)
176 | 
177 | 
178 | class FeatureInput(object):
179 |     def __init__(self, path, samplerate=24000, hop_size=256):
180 |         self.fs = samplerate
181 |         self.hop = hop_size
182 |         self.path = path
183 | 
184 |         self.f0_bin = 256
185 |         self.f0_max = 1100.0
186 |         self.f0_min = 50.0
187 |         self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700)
188 |         self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
189 | 
190 |     def compute_f0(self, filename):
191 |         x, sr = librosa.load(self.path + filename, self.fs)
192 |         assert sr == self.fs
193 |         f0, t = pyworld.dio(
194 |             x.astype(np.double),
195 |             fs=sr,
196 |             f0_ceil=800,
197 |             frame_period=1000 * self.hop / sr,
198 |         )
199 |         f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs)
200 |         for index, pitch in enumerate(f0):
201 |             f0[index] = round(pitch, 1)
202 |         return f0
203 | 
204 |     def coarse_f0(self, f0):
205 |         f0_mel = 1127 * np.log(1 + f0 / 700)
206 |         f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * (
207 |             self.f0_bin - 2
208 |         ) / (self.f0_mel_max - self.f0_mel_min) + 1
209 | 
210 |         # use 0 or 1
211 |         f0_mel[f0_mel <= 1] = 1
212 |         f0_mel[f0_mel > self.f0_bin - 1] = self.f0_bin - 1
213 |         f0_coarse = np.rint(f0_mel).astype(np.int)
214 |         assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, (
215 |             f0_coarse.max(),
216 |             f0_coarse.min(),
217 |         )
218 |         return f0_coarse
219 | 
220 |     def diff_f0(self, scores_pit, featur_pit, labels_frames):
221 |         length_pit = min(len(scores_pit), len(featur_pit))
222 |         offset_pit = np.zeros(length_pit, dtype=np.int)
223 |         for idx in range(length_pit):
224 |             s_pit = scores_pit[idx]
225 |             f_pit = featur_pit[idx]
226 |             if s_pit == 0 or f_pit == 0:
227 |                 offset_pit[idx] = 0
228 |             else:
229 |                 tmp = int(f_pit - s_pit)
230 |                 tmp = +128 if tmp > +128 else tmp
231 |                 tmp = -127 if tmp < -127 else tmp
232 |                 tmp = 256 + tmp if tmp < 0 else tmp
233 |                 offset_pit[idx] = tmp
234 |         offset_pit[offset_pit > 255] = 255
235 |         offset_pit[offset_pit < 0] = 0
236 |         # start = 0
237 |         # for num in labels_frames:
238 |         #     print("---------------------------------------------")
239 |         #     print(scores_pit[start:start+num])
240 |         #     print(featur_pit[start:start+num])
241 |         #     print(offset_pit[start:start+num])
242 |         #     start += num
243 |         return offset_pit
244 | 
245 | 
246 | if __name__ == "__main__":
247 |     output_path = "../VISinger_ofuton_data/label_vits_phn/"
248 |     wav_path = "../VISinger_ofuton_data/wav_dump_24k/"
249 |     logging.basicConfig(level=logging.INFO)  # ERROR & INFO
250 |     pitch_norm = True
251 |     pitch_intp = True
252 |     uv_process = False
253 | 
254 |     notemaper = load_midi_map()
255 |     logging.info(notemaper)
256 | 
257 |     sample_rate = 24000
258 |     hop_size = 256
259 |     singInput = SingInput(sample_rate, hop_size)
260 |     featureInput = FeatureInput(wav_path, sample_rate, hop_size)
261 | 
262 |     if not os.path.exists(output_path):
263 |         os.mkdir(output_path)
264 | 
265 |     fo = open("../VISinger_ofuton_data/transcriptions.txt", "r+")
266 |     # vits_file = open("./filelists/vits_file_phn.txt", "w", encoding="utf-8")
267 |     vits_file = open("./filelists/vits_file.txt", "w", encoding="utf-8")
268 |     i = 0
269 |     all_txt = []  # 统计非重复的句子个数
270 |     while True:
271 |         try:
272 |             message = fo.readline().strip()
273 |         except Exception as e:
274 |             print("nothing of except:", e)
275 |             break
276 |         if message == None:
277 |             break
278 |         if message == "":
279 |             break
280 |         i = i + 1
281 |         # if i > 5:
282 |         #     exit()
283 |         infos = message.split("|")
284 |         file = infos[0]
285 |         hanz = infos[1]
286 |         all_txt.append(hanz)
287 |         phon = infos[2].split(" ")
288 |         note = infos[3].split(" ")
289 |         note_dur = infos[4].split(" ")
290 |         phon_dur = infos[5].split(" ")
291 |         phon_slur = infos[6].split(" ")
292 | 
293 |         logging.info("----------------------------")
294 |         logging.info("file {}".format(file))
295 |         logging.info("lyrics {}".format(hanz))
296 |         logging.info("phn {}".format(phon))
297 |         # logging.info(note_dur)
298 |         # logging.info(phon_dur)
299 |         # logging.info(phon_slur)
300 | 
301 |         (
302 |             file,
303 |             labels_ids,
304 |             labels_dur,
305 |             scores_ids,
306 |             scores_dur,
307 |             labels_slr,
308 |             # labels_uvs,
309 |         ) = singInput.parseInput(message)
310 |         # labels_ids = singInput.expandInput(labels_ids, labels_frames)
311 |         # labels_uvs = singInput.expandInput(labels_uvs, labels_frames)
312 |         # labels_slr = singInput.expandInput(labels_slr, labels_frames)
313 |         # scores_ids = singInput.expandInput(scores_ids, labels_frames)
314 |         # scores_pit = singInput.scorePitch(scores_ids)
315 |         featur_pit = featureInput.compute_f0(f"{file}.wav")
316 |         wav_file = os.path.join(wav_path, file + ".wav")
317 | 
318 |         spec_len = singInput.align_process(wav_file, labels_dur)
319 | 
320 |         # extend uv
321 |         # labels_uvs = np.repeat(labels_uvs, labels_dur, axis=0)
322 | 
323 |         featur_pit = featur_pit[:spec_len]
324 | 
325 |         if featur_pit.shape[0] < spec_len:
326 |             pad_length = spec_len - featur_pit.shape[0]
327 |             featur_pit = np.pad(featur_pit, pad_width=(0, pad_length), mode="constant")
328 |         assert featur_pit.shape[0] == spec_len
329 |         # if uv_process:
330 |         #     featur_pit = featur_pit * labels_uvs
331 |         coarse_pit = featureInput.coarse_f0(featur_pit)
332 | 
333 |         # log f0
334 |         if not pitch_norm:
335 |             nonzero_idxs = np.where(featur_pit != 0)[0]
336 |             featur_pit[nonzero_idxs] = np.log(featur_pit[nonzero_idxs])
337 |         else:
338 |             featur_pit = 2595.0 * np.log10(1.0 + featur_pit / 700.0) / 500
339 | 
340 |         if pitch_intp:
341 |             uv = featur_pit == 0
342 |             featur_pit_intp = np.copy(featur_pit)
343 |             featur_pit_intp[uv] = np.interp(
344 |                 np.where(uv)[0], np.where(~uv)[0], featur_pit[~uv]
345 |             )
346 | 
347 |         # offset_pit = featureInput.diff_f0(scores_pit, featur_pit, labels_frames)
348 |         # assert len(labels_ids) == len(coarse_pit)
349 |         assert len(labels_ids) == len(labels_dur)
350 |         assert len(labels_dur) == len(scores_ids)
351 |         assert len(scores_ids) == len(scores_dur)
352 |         assert len(scores_dur) == len(labels_slr)
353 | 
354 |         logging.info("labels_ids {}".format(labels_ids))
355 |         # logging.info("labels_dur {}".format(labels_dur))
356 |         # logging.info("scores_ids {}".format(scores_ids))
357 |         # logging.info("scores_dur {}".format(scores_dur))
358 |         # logging.info("labels_slr {}".format(labels_slr))
359 |         # logging.info("labels_uvs {}".format(labels_uvs))
360 |         # logging.info("featur_pit {}".format(featur_pit))
361 |         logging.info("featur_pit_intp {}".format(featur_pit_intp))
362 | 
363 |         np.save(
364 |             output_path + f"{file}_label.npy",
365 |             labels_ids,
366 |             allow_pickle=False,
367 |         )
368 |         np.save(
369 |             output_path + f"{file}_label_dur.npy",
370 |             labels_dur,
371 |             allow_pickle=False,
372 |         )
373 |         np.save(
374 |             output_path + f"{file}_score.npy",
375 |             scores_ids,
376 |             allow_pickle=False,
377 |         )
378 |         np.save(
379 |             output_path + f"{file}_score_dur.npy",
380 |             scores_dur,
381 |             allow_pickle=False,
382 |         )
383 |         if not pitch_intp:
384 |             np.save(
385 |                 output_path + f"{file}_pitch.npy",
386 |                 featur_pit,
387 |                 allow_pickle=False,
388 |             )
389 |         else:
390 |             np.save(
391 |                 output_path + f"{file}_pitch.npy",
392 |                 featur_pit_intp,
393 |                 allow_pickle=False,
394 |             )
395 |         # np.save(
396 |         #     output_path + f"{file}_pitch.npy",
397 |         #     coarse_pit,
398 |         #     allow_pickle=False,
399 |         # )
400 |         np.save(
401 |             output_path + f"{file}_slurs.npy",
402 |             labels_slr,
403 |             allow_pickle=False,
404 |         )
405 | 
406 |         # wave path|label path|label frame|score path|score duration;上面是一个.（当前目录），下面是两个..（从子目录调用）
407 |         path_wave = wav_path + f"{file}.wav"
408 |         path_label = output_path + f"{file}_label.npy"
409 |         path_label_dur = output_path + f"{file}_label_dur.npy"
410 |         path_score = output_path + f"{file}_score.npy"
411 |         path_score_dur = output_path + f"{file}_score_dur.npy"
412 |         path_pitch = output_path + f"{file}_pitch.npy"
413 |         path_slurs = output_path + f"{file}_slurs.npy"
414 |         print(
415 |             f"{path_wave}|{path_label}|{path_label_dur}|{path_score}|{path_score_dur}|{path_pitch}|{path_slurs}",
416 |             file=vits_file,
417 |         )
418 | 
419 |     fo.close()
420 |     vits_file.close()
421 |     print(len(set(all_txt)))  # 统计非重复的句子个数
422 | 


--------------------------------------------------------------------------------
/prepare/dur_to_frame.py:
--------------------------------------------------------------------------------
1 | def dur_to_frame(ds, fs, hop_size):
2 |     frames = [int(i * fs / hop_size + 0.5) for i in ds]
3 |     return frames
4 | 


--------------------------------------------------------------------------------
/prepare/gen_ofuton_transcript.py:
--------------------------------------------------------------------------------
  1 | import music21 as m21
  2 | import os
  3 | from typing import Iterable, List, Optional, Union
  4 | 
  5 | 
  6 | def pyopenjtalk_g2p(text) -> List[str]:
  7 |     import pyopenjtalk
  8 | 
  9 |     # phones is a str object separated by space
 10 |     phones = pyopenjtalk.g2p(text, kana=False)
 11 |     phones = phones.split(" ")
 12 |     return phones
 13 | 
 14 | 
 15 | def text2tokens_svs(syllable: str) -> List[str]:
 16 |     customed_dic = {
 17 |         "へ": ["h", "e"],
 18 |         "ヴぁ": ["v", "a"],
 19 |         "ヴぃ": ["v", "i"],
 20 |         "ヴぇ": ["v", "e"],
 21 |         "ヴぉ": ["v", "i"],
 22 |         "でぇ": ["dy", "e"],
 23 |     }
 24 |     tokens = pyopenjtalk_g2p(syllable)
 25 |     if syllable in customed_dic:
 26 |         tokens = customed_dic[syllable]
 27 |     return tokens
 28 | 
 29 | 
 30 | def note_filter(note_name, note_map):
 31 |     note_name = note_name.replace("-", "")
 32 |     if "#" in note_name:
 33 |         note_name = note_name + "/" + note_map[note_name[0]] + "b" + note_name[2]
 34 |     return note_name
 35 | 
 36 | 
 37 | # eval(valid), dev(test), train
 38 | def process(base_path, file_path):
 39 | 
 40 |     note_map = {
 41 |         "A": "B",
 42 |         "B": "C",
 43 |         "C": "D",
 44 |         "D": "E",
 45 |         "E": "F",
 46 |         "F": "G",
 47 |         "G": "A",
 48 |     }
 49 | 
 50 |     label_path = file_path + "label"
 51 |     text_path = file_path + "text"
 52 |     data = []
 53 | 
 54 |     for line in open(label_path, "r"):
 55 |         # add phn and phn_dur
 56 |         str_list = line.replace("\n", "").split(" ")
 57 |         name = str_list[0]
 58 |         phn_dur = []
 59 |         phn = []
 60 |         score = []
 61 |         score_dur = []
 62 | 
 63 |         for i in range(1, len(str_list)):
 64 |             try:
 65 |                 phn_dur_ = str(round(float(str_list[i + 1]) - float(str_list[i]), 6))
 66 |                 # phn_dur_ = float(str_list[i + 1]) - float(str_list[i])
 67 |             except:
 68 |                 if str_list[i] != "" and str_list[i].isalpha():
 69 |                     phn.append(str_list[i])
 70 |                     phn_dict.add(str_list[i])
 71 |                 continue
 72 | 
 73 |             phn_dur.append(phn_dur_)
 74 | 
 75 |         # append text
 76 |         for line2 in open(text_path, "r"):
 77 |             str_list2 = line2.replace("\n", "").split(" ")
 78 |             if str_list2[0] != name:
 79 |                 continue
 80 |             else:
 81 |                 text_ = str_list2[1]
 82 |                 break
 83 | 
 84 |         # add score and score_dur
 85 |         musicxmlscp = open(os.path.join(file_path, "xml.scp"), "r", encoding="utf-8")
 86 |         for xml_line in musicxmlscp:
 87 |             xmlline = xml_line.strip().split(" ")
 88 |             recording_id = xmlline[0]
 89 |             if recording_id != name:
 90 |                 continue
 91 |             else:
 92 |                 path = base_path + xmlline[1]
 93 |                 parse_file = m21.converter.parse(path)
 94 |                 part = parse_file.parts[0].flat
 95 |                 m = parse_file.metronomeMarkBoundaries()
 96 |                 tempo = m[0][2]
 97 |                 for part in parse_file.parts:
 98 |                     for note in part.recurse().notes:
 99 |                         note_dur_ = note.quarterLength * 60 / tempo.number
100 |                         note_name_ = note_filter(note.nameWithOctave, note_map)
101 |                         note_text_ = note.lyric
102 |                         # print("note_text1", text_)
103 |                         # print("note_text_", note_text_)
104 |                         if not note_text_:
105 |                             continue
106 |                         note_phn_ = text2tokens_svs(note_text_)
107 |                         for i in range(len(note_phn_)):
108 |                             score.append(note_name_)
109 |                             score_dur.append(str(note_dur_))
110 |                         # print("note_phn", note_phn_)
111 |                 break
112 | 
113 |         # print("tempo", tempo.number)
114 | 
115 |         # TODO: add slur. currently all 0
116 |         slur = []
117 |         for i in range(len(phn)):
118 |             slur.append("0")
119 | 
120 |         # add one line
121 |         data.append(
122 |             name
123 |             + "|"
124 |             + text_
125 |             + "|"
126 |             + " ".join(phn)
127 |             + "|"
128 |             + " ".join(score)
129 |             + "|"
130 |             + " ".join(score_dur)
131 |             + "|"
132 |             + " ".join(phn_dur)
133 |             + "|"
134 |             + " ".join(slur)
135 |         )
136 |         print(data)
137 |         assert len(phn) == len(phn_dur)
138 |         assert len(phn) == len(score)
139 |         assert len(phn) == len(score_dur)
140 |         assert len(phn) == len(slur)
141 |     return data
142 | 
143 | 
144 | base_path = "/home/yyu479/espnet/egs2/ofuton_p_utagoe_db/svs1/"
145 | 
146 | data = []
147 | phn_dict = set()
148 | data_eval = process(base_path, base_path + "dump/raw/eval/")
149 | data_dev = process(base_path, base_path + "dump/raw/org/dev/")
150 | data_tr_no_dev = process(base_path, base_path + "dump/raw/org/tr_no_dev/")
151 | data = data_eval + data_dev + data_tr_no_dev
152 | 
153 | with open("transcriptions.txt", "w") as f:
154 |     for i in data:
155 |         f.writelines(i)
156 |         f.write("\n")
157 | 
158 | phn_dict_sort = list(phn_dict)
159 | phn_dict_sort.sort()
160 | with open("dict.txt", "w") as f:
161 |     for i in phn_dict_sort:
162 |         f.writelines(i)
163 |         f.write("\n")
164 | 


--------------------------------------------------------------------------------
/prepare/midi-HZ.scp:
--------------------------------------------------------------------------------
  1 | 127     G9          12543.9
  2 | 126     F#9/Gb9     11839.8
  3 | 125     F9          11175.3
  4 | 124     E9          10548.1
  5 | 123     D#9/Eb9     9956.1 
  6 | 122     D9          9397.3 
  7 | 121     C#9/Db9     8869.8 
  8 | 120     C9          8372   
  9 | 119     B8          7902.1 
 10 | 118     A#8/Bb8     7458.6 
 11 | 117     A8          7040   
 12 | 116     G#8/Ab8     6644.9 
 13 | 115     G8          6271.9 
 14 | 114     F#8/Gb8     5919.9 
 15 | 113     F8          5587.7 
 16 | 112     E8          5274   
 17 | 111     D#8/Eb8     4978   
 18 | 110     D8          4698.6 
 19 | 109     C#8/Db8     4434.9 
 20 | 108     C8          4186   
 21 | 107     B7          3951.1 
 22 | 106     A#7/Bb7     3729.3 
 23 | 105     A7          3520   
 24 | 104     G#7/Ab7     3322.4 
 25 | 103     G7          3136   
 26 | 102     F#7/Gb7     2960   
 27 | 101     F7          2793.8 
 28 | 100     E7          2637   
 29 | 99      D#7/Eb7     2489   
 30 | 98      D7          2349.3 
 31 | 97      C#7/Db7     2217.5 
 32 | 96      C7          2093   
 33 | 95      B6          1975.5 
 34 | 94      A#6/Bb6     1864.7 
 35 | 93      A6          1760   
 36 | 92      G#6/Ab6     1661.2 
 37 | 91      G6          1568   
 38 | 90      F#6/Gb6     1480   
 39 | 89      F6          1396.9 
 40 | 88      E6          1318.5 
 41 | 87      D#6/Eb6     1244.5 
 42 | 86      D6          1174.7 
 43 | 85      C#6/Db6     1108.7 
 44 | 84      C6          1046.5 
 45 | 83      B5          987.8  
 46 | 82      A#5/Bb5     932.3  
 47 | 81      A5          880    
 48 | 80      G#5/Ab5     830.6  
 49 | 79      G5          784    
 50 | 78      F#5/Gb5     740    
 51 | 77      F5          698.5  
 52 | 76      E5          659.3  
 53 | 75      D#5/Eb5     622.3  
 54 | 74      D5          587.3  
 55 | 73      C#5/Db5     554.4  
 56 | 72      C5          523.3  
 57 | 71      B4          493.9  
 58 | 70      A#4/Bb4     466.2  
 59 | 69      A4          440    
 60 | 68      G#4/Ab4     415.3  
 61 | 67      G4          392    
 62 | 66      F#4/Gb4     370    
 63 | 65      F4          349.2  
 64 | 64      E4          329.6  
 65 | 63      D#4/Eb4     311.1  
 66 | 62      D4          293.7  
 67 | 61      C#4/Db4     277.2  
 68 | 60      C4          261.6  
 69 | 59      B3          246.9  
 70 | 58      A#3/Bb3     233.1  
 71 | 57      A3          220    
 72 | 56      G#3/Ab3     207.7  
 73 | 55      G3          196    
 74 | 54      F#3/Gb3     185    
 75 | 53      F3          174.6  
 76 | 52      E3          164.8  
 77 | 51      D#3/Eb3     155.6  
 78 | 50      D3          146.8  
 79 | 49      C#3/Db3     138.6  
 80 | 48      C3          130.8  
 81 | 47      B2          123.5  
 82 | 46      A#2/Bb2     116.5  
 83 | 45      A2          110    
 84 | 44      G#2/Ab2     103.   
 85 | 43      G2          98     
 86 | 42      F#2/Gb2     92.5   
 87 | 41      F2          87.3   
 88 | 40      E2          82.4   
 89 | 39      D#2/Eb2     77.8   
 90 | 38      D2          73.4   
 91 | 37      C#2/Db2     69.3   
 92 | 36      C2          65.4   
 93 | 35      B1          61.7   
 94 | 34      A#1/Bb1     58.3   
 95 | 33      A1          55     
 96 | 32      G#1/Ab1     51.9   
 97 | 31      G1          49     
 98 | 30      F#1/Gb1     46.2   
 99 | 29      F1          43.7   
100 | 28      E1          41.2   
101 | 27      D#1/Eb1     38.9   
102 | 26      D1          36.7   
103 | 25      C#1/Db1     34.6   
104 | 24      C1          32.7   
105 | 23      B0          30.9   
106 | 22      A#0/Bb0     29.1   
107 | 21      A0          27.5   
108 | 0       rest        0        


--------------------------------------------------------------------------------
/prepare/midi-note.scp:
--------------------------------------------------------------------------------
  1 | 127	G9
  2 | 126	F#9/Gb9
  3 | 125	F9
  4 | 124	E9
  5 | 123	D#9/Eb9
  6 | 122	D9
  7 | 121	C#9/Db9
  8 | 120	C9
  9 | 119	B8
 10 | 118	A#8/Bb8
 11 | 117	A8
 12 | 116	G#8/Ab8
 13 | 115	G8
 14 | 114	F#8/Gb8
 15 | 113	F8
 16 | 112	E8
 17 | 111	D#8/Eb8
 18 | 110	D8
 19 | 109	C#8/Db8
 20 | 108	C8
 21 | 107	B7
 22 | 106	A#7/Bb7
 23 | 105	A7
 24 | 104	G#7/Ab7
 25 | 103	G7
 26 | 102	F#7/Gb7
 27 | 101	F7
 28 | 100	E7
 29 | 99	D#7/Eb7
 30 | 98	D7
 31 | 97	C#7/Db7
 32 | 96	C7
 33 | 95	B6
 34 | 94	A#6/Bb6
 35 | 93	A6
 36 | 92	G#6/Ab6
 37 | 91	G6
 38 | 90	F#6/Gb6
 39 | 89	F6
 40 | 88	E6
 41 | 87	D#6/Eb6
 42 | 86	D6
 43 | 85	C#6/Db6
 44 | 84	C6
 45 | 83	B5
 46 | 82	A#5/Bb5
 47 | 81	A5
 48 | 80	G#5/Ab5
 49 | 79	G5
 50 | 78	F#5/Gb5
 51 | 77	F5
 52 | 76	E5
 53 | 75	D#5/Eb5
 54 | 74	D5
 55 | 73	C#5/Db5
 56 | 72	C5
 57 | 71	B4
 58 | 70	A#4/Bb4
 59 | 69	A4
 60 | 68	G#4/Ab4
 61 | 67	G4
 62 | 66	F#4/Gb4
 63 | 65	F4
 64 | 64	E4
 65 | 63	D#4/Eb4
 66 | 62	D4
 67 | 61	C#4/Db4
 68 | 60	C4
 69 | 59	B3
 70 | 58	A#3/Bb3
 71 | 57	A3
 72 | 56	G#3/Ab3
 73 | 55	G3
 74 | 54	F#3/Gb3
 75 | 53	F3
 76 | 52	E3
 77 | 51	D#3/Eb3
 78 | 50	D3
 79 | 49	C#3/Db3
 80 | 48	C3
 81 | 47	B2
 82 | 46	A#2/Bb2
 83 | 45	A2
 84 | 44	G#2/Ab2
 85 | 43	G2
 86 | 42	F#2/Gb2
 87 | 41	F2
 88 | 40	E2
 89 | 39	D#2/Eb2
 90 | 38	D2
 91 | 37	C#2/Db2
 92 | 36	C2
 93 | 35	B1
 94 | 34	A#1/Bb1
 95 | 33	A1
 96 | 32	G#1/Ab1
 97 | 31	G1
 98 | 30	F#1/Gb1
 99 | 29	F1
100 | 28	E1
101 | 27	D#1/Eb1
102 | 26	D1
103 | 25	C#1/Db1
104 | 24	C1
105 | 23	B0
106 | 22	A#0/Bb0
107 | 21	A0


--------------------------------------------------------------------------------
/prepare/phone_map.py:
--------------------------------------------------------------------------------
  1 | _pause = ["unk", "sos", "eos", "ap", "sp"]
  2 | 
  3 | _initials = [
  4 |     "b",
  5 |     "c",
  6 |     "ch",
  7 |     "d",
  8 |     "f",
  9 |     "g",
 10 |     "h",
 11 |     "j",
 12 |     "k",
 13 |     "l",
 14 |     "m",
 15 |     "n",
 16 |     "p",
 17 |     "q",
 18 |     "r",
 19 |     "s",
 20 |     "sh",
 21 |     "t",
 22 |     "w",
 23 |     "x",
 24 |     "y",
 25 |     "z",
 26 |     "zh",
 27 | ]
 28 | 
 29 | _finals = [
 30 |     "a",
 31 |     "ai",
 32 |     "an",
 33 |     "ang",
 34 |     "ao",
 35 |     "e",
 36 |     "ei",
 37 |     "en",
 38 |     "eng",
 39 |     "er",
 40 |     "i",
 41 |     "ia",
 42 |     "ian",
 43 |     "iang",
 44 |     "iao",
 45 |     "ie",
 46 |     "in",
 47 |     "ing",
 48 |     "iong",
 49 |     "iu",
 50 |     "o",
 51 |     "ong",
 52 |     "ou",
 53 |     "u",
 54 |     "ua",
 55 |     "uai",
 56 |     "uan",
 57 |     "uang",
 58 |     "ui",
 59 |     "un",
 60 |     "uo",
 61 |     "v",
 62 |     "van",
 63 |     "ve",
 64 |     "vn",
 65 | ]
 66 | 
 67 | lang = "cn"
 68 | if lang == "cn":
 69 |     symbols = _pause + _initials + _finals
 70 | elif lang == "jp":
 71 |     symbols = [
 72 |         "I",
 73 |         "N",
 74 |         "a",
 75 |         "b",
 76 |         "by",
 77 |         "ch",
 78 |         "cl",
 79 |         "d",
 80 |         "dy",
 81 |         "e",
 82 |         "f",
 83 |         "g",
 84 |         "gy",
 85 |         "h",
 86 |         "hy",
 87 |         "i",
 88 |         "j",
 89 |         "k",
 90 |         "ky",
 91 |         "m",
 92 |         "my",
 93 |         "n",
 94 |         "ny",
 95 |         "o",
 96 |         "p",
 97 |         "py",
 98 |         "r",
 99 |         "ry",
100 |         "s",
101 |         "sh",
102 |         "t",
103 |         "ts",
104 |         "ty",
105 |         "u",
106 |         "v",
107 |         "w",
108 |         "y",
109 |         "z",
110 |     ]
111 | 
112 | # Mappings from symbol to numeric ID and vice versa:
113 | _symbol_to_id = {s: i for i, s in enumerate(symbols)}
114 | _id_to_symbol = {i: s for i, s in enumerate(symbols)}
115 | 
116 | 
117 | def label_to_ids(phones):
118 |     # use lower letter
119 |     if lang == "cn":
120 |         sequence = [_symbol_to_id[symbol.lower()] for symbol in phones]
121 |     elif lang == "jp":
122 |         sequence = [_symbol_to_id[symbol] for symbol in phones]
123 |     return sequence
124 | 
125 | 
126 | def get_vocab_size():
127 |     return len(symbols)
128 | 


--------------------------------------------------------------------------------
/prepare/phone_uv.py:
--------------------------------------------------------------------------------
 1 | # 普通话发音基础声母韵母
 2 | # 普通话声母只有 4 个浊音：m、n、l、r，其余 17 个辅音声母都是清音
 3 | # 汉语拼音的 y 和 w 只出现在零声母音节的开头，它们的作用主要是使音节界限清楚。
 4 | # https://baijiahao.baidu.com/s?id=1655739561730224990&wfr=spider&for=pc
 5 | 
 6 | uv_map = {
 7 |     "unk":0,
 8 |     "sos":0,
 9 |     "eos":0,
10 |     "ap":0,
11 |     "sp":0,
12 |     "b":0,
13 |     "c":0,
14 |     "ch":0,
15 |     "d":0,
16 |     "f":0,
17 |     "g":0,
18 |     "h":0,
19 |     "j":0,
20 |     "k":0,
21 |     "l":1,
22 |     "m":1,
23 |     "n":1,
24 |     "p":0,
25 |     "q":0,
26 |     "r":1,
27 |     "s":0,
28 |     "sh":0,
29 |     "t":0,
30 |     "w":1,
31 |     "x":0,
32 |     "y":1,
33 |     "z":0,
34 |     "zh":0,
35 |     "a":1,
36 |     "ai":1,
37 |     "an":1,
38 |     "ang":1,
39 |     "ao":1,
40 |     "e":1,
41 |     "ei":1,
42 |     "en":1,
43 |     "eng":1,
44 |     "er":1,
45 |     "i":1,
46 |     "ia":1,
47 |     "ian":1,
48 |     "iang":1,
49 |     "iao":1,
50 |     "ie":1,
51 |     "in":1,
52 |     "ing":1,
53 |     "iong":1,
54 |     "iu":1,
55 |     "o":1,
56 |     "ong":1,
57 |     "ou":1,
58 |     "u":1,
59 |     "ua":1,
60 |     "uai":1,
61 |     "uan":1,
62 |     "uang":1,
63 |     "ui":1,
64 |     "un":1,
65 |     "uo":1,
66 |     "v":1,
67 |     "van":1,
68 |     "ve":1,
69 |     "vn":1
70 | }


--------------------------------------------------------------------------------
/prepare/preprocess.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | 
 3 | if __name__ == "__main__":
 4 | 
 5 |     alls = []
 6 |     fo = open("./filelists/vits_file.txt", "r+")
 7 |     while True:
 8 |         try:
 9 |             message = fo.readline().strip()
10 |         except Exception as e:
11 |             print("nothing of except:", e)
12 |             break
13 |         if message == None:
14 |             break
15 |         if message == "":
16 |             break
17 |         alls.append(message)
18 |     fo.close()
19 | 
20 |     valids = alls[:150]
21 |     tests = alls[150:300]
22 |     trains = alls[300:]
23 | 
24 |     random.shuffle(trains)
25 | 
26 |     fw = open("./filelists/singing_valid.txt", "w", encoding="utf-8")
27 |     for strs in valids:
28 |         print(strs, file=fw)
29 |     fw.close()
30 | 
31 |     fw = open("./filelists/singing_test.txt", "w", encoding="utf-8")
32 |     for strs in tests:
33 |         print(strs, file=fw)
34 | 
35 |     fw = open("./filelists/singing_train.txt", "w", encoding="utf-8")
36 |     for strs in trains:
37 |         print(strs, file=fw)
38 | 
39 |     fw.close()
40 | 


--------------------------------------------------------------------------------
/prepare/preprocess_jp.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | 
 3 | if __name__ == "__main__":
 4 | 
 5 |     alls = []
 6 |     fo = open("./filelists/vits_file.txt", "r+")
 7 |     while True:
 8 |         try:
 9 |             message = fo.readline().strip()
10 |         except Exception as e:
11 |             print("nothing of except:", e)
12 |             break
13 |         if message == None:
14 |             break
15 |         if message == "":
16 |             break
17 |         alls.append(message)
18 |     fo.close()
19 | 
20 |     valids = alls[:70]
21 |     tests = alls[70:134]
22 |     trains = alls[134:]
23 | 
24 |     random.shuffle(trains)
25 | 
26 |     fw = open("./filelists/singing_valid.txt", "w", encoding="utf-8")
27 |     for strs in valids:
28 |         print(strs, file=fw)
29 |     fw.close()
30 | 
31 |     fw = open("./filelists/singing_test.txt", "w", encoding="utf-8")
32 |     for strs in tests:
33 |         print(strs, file=fw)
34 | 
35 |     fw = open("./filelists/singing_train.txt", "w", encoding="utf-8")
36 |     for strs in trains:
37 |         print(strs, file=fw)
38 | 
39 |     fw.close()
40 | 


--------------------------------------------------------------------------------
/prepare/resample_wav.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | 
 4 | 
 5 | def process_utterance(
 6 |     audio_dir,
 7 |     wav_dumpdir,
 8 |     segment,
 9 |     tgt_sr=24000,
10 | ):
11 |     uid, lyrics, phns, notes, syb_dur, phn_dur, keep = segment.strip().split("|")
12 |     cmd = "sox {}.wav -c 1 -t wavpcm -b 16 -r {} {}_bits16.wav".format(
13 |         os.path.join(audio_dir, uid),
14 |         tgt_sr,
15 |         os.path.join(wav_dumpdir, uid),
16 |     )
17 |     print("uid", uid)
18 |     os.system(cmd)
19 | 
20 | 
21 | def process_subset(args, set_name):
22 |     with open(
23 |         os.path.join(args.src_data, "segments", set_name + ".txt"),
24 |         "r",
25 |         encoding="utf-8",
26 |     ) as f:
27 |         segments = f.read().strip().split("\n")
28 |         for segment in segments:
29 |             process_utterance(
30 |                 os.path.join(args.src_data, "segments", "wavs"),
31 |                 args.wav_dumpdir,
32 |                 segment,
33 |                 tgt_sr=args.sr,
34 |             )
35 | 
36 | 
37 | if __name__ == "__main__":
38 |     parser = argparse.ArgumentParser(description="Prepare Data for Opencpop Database")
39 |     parser.add_argument("src_data", type=str, help="source data directory")
40 |     parser.add_argument(
41 |         "--wav_dumpdir", type=str, help="wav dump directoyr (rebit)", default="wav_dump"
42 |     )
43 |     parser.add_argument("--sr", type=int, help="sampling rate (Hz)")
44 |     args = parser.parse_args()
45 | 
46 |     for name in ["train", "test"]:
47 |         process_subset(args, name)
48 | 


--------------------------------------------------------------------------------
/prepare/resample_wav.sh:
--------------------------------------------------------------------------------
1 | OPENCPOP=/home/yyu479/svs/data/Opencpop/
2 | fs=24000
3 | output=/home/yyu479/VISinger_data/wav_dump_24k
4 | mkdir -p ${output}
5 | python resample_wav.py ${OPENCPOP} \
6 |         --wav_dumpdir ${output} \
7 |         --sr ${fs}


--------------------------------------------------------------------------------
/resource/2005000151.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/2005000151.wav


--------------------------------------------------------------------------------
/resource/2005000152.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/2005000152.wav


--------------------------------------------------------------------------------
/resource/2006000186.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/2006000186.wav


--------------------------------------------------------------------------------
/resource/2006000187.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/2006000187.wav


--------------------------------------------------------------------------------
/resource/2008000268.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/2008000268.wav


--------------------------------------------------------------------------------
/resource/vising_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/vising_loss.png


--------------------------------------------------------------------------------
/resource/vising_mel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jerryuhoo/VISinger/ad8bc167c10275dd513ae466e73deae2f7045c99/resource/vising_mel.png


--------------------------------------------------------------------------------
/train.sh:
--------------------------------------------------------------------------------
1 | nohup python train.py -c configs/singing_base.json -m singing_base &


--------------------------------------------------------------------------------
/transforms.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch.nn import functional as F
  3 | 
  4 | import numpy as np
  5 | 
  6 | 
  7 | DEFAULT_MIN_BIN_WIDTH = 1e-3
  8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3
  9 | DEFAULT_MIN_DERIVATIVE = 1e-3
 10 | 
 11 | 
 12 | def piecewise_rational_quadratic_transform(inputs, 
 13 |                                            unnormalized_widths,
 14 |                                            unnormalized_heights,
 15 |                                            unnormalized_derivatives,
 16 |                                            inverse=False,
 17 |                                            tails=None, 
 18 |                                            tail_bound=1.,
 19 |                                            min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 20 |                                            min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 21 |                                            min_derivative=DEFAULT_MIN_DERIVATIVE):
 22 | 
 23 |     if tails is None:
 24 |         spline_fn = rational_quadratic_spline
 25 |         spline_kwargs = {}
 26 |     else:
 27 |         spline_fn = unconstrained_rational_quadratic_spline
 28 |         spline_kwargs = {
 29 |             'tails': tails,
 30 |             'tail_bound': tail_bound
 31 |         }
 32 | 
 33 |     outputs, logabsdet = spline_fn(
 34 |             inputs=inputs,
 35 |             unnormalized_widths=unnormalized_widths,
 36 |             unnormalized_heights=unnormalized_heights,
 37 |             unnormalized_derivatives=unnormalized_derivatives,
 38 |             inverse=inverse,
 39 |             min_bin_width=min_bin_width,
 40 |             min_bin_height=min_bin_height,
 41 |             min_derivative=min_derivative,
 42 |             **spline_kwargs
 43 |     )
 44 |     return outputs, logabsdet
 45 | 
 46 | 
 47 | def searchsorted(bin_locations, inputs, eps=1e-6):
 48 |     bin_locations[..., -1] += eps
 49 |     return torch.sum(
 50 |         inputs[..., None] >= bin_locations,
 51 |         dim=-1
 52 |     ) - 1
 53 | 
 54 | 
 55 | def unconstrained_rational_quadratic_spline(inputs,
 56 |                                             unnormalized_widths,
 57 |                                             unnormalized_heights,
 58 |                                             unnormalized_derivatives,
 59 |                                             inverse=False,
 60 |                                             tails='linear',
 61 |                                             tail_bound=1.,
 62 |                                             min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 63 |                                             min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 64 |                                             min_derivative=DEFAULT_MIN_DERIVATIVE):
 65 |     inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
 66 |     outside_interval_mask = ~inside_interval_mask
 67 | 
 68 |     outputs = torch.zeros_like(inputs)
 69 |     logabsdet = torch.zeros_like(inputs)
 70 | 
 71 |     if tails == 'linear':
 72 |         unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
 73 |         constant = np.log(np.exp(1 - min_derivative) - 1)
 74 |         unnormalized_derivatives[..., 0] = constant
 75 |         unnormalized_derivatives[..., -1] = constant
 76 | 
 77 |         outputs[outside_interval_mask] = inputs[outside_interval_mask]
 78 |         logabsdet[outside_interval_mask] = 0
 79 |     else:
 80 |         raise RuntimeError('{} tails are not implemented.'.format(tails))
 81 | 
 82 |     outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline(
 83 |         inputs=inputs[inside_interval_mask],
 84 |         unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
 85 |         unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
 86 |         unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
 87 |         inverse=inverse,
 88 |         left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
 89 |         min_bin_width=min_bin_width,
 90 |         min_bin_height=min_bin_height,
 91 |         min_derivative=min_derivative
 92 |     )
 93 | 
 94 |     return outputs, logabsdet
 95 | 
 96 | def rational_quadratic_spline(inputs,
 97 |                               unnormalized_widths,
 98 |                               unnormalized_heights,
 99 |                               unnormalized_derivatives,
100 |                               inverse=False,
101 |                               left=0., right=1., bottom=0., top=1.,
102 |                               min_bin_width=DEFAULT_MIN_BIN_WIDTH,
103 |                               min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
104 |                               min_derivative=DEFAULT_MIN_DERIVATIVE):
105 |     if torch.min(inputs) < left or torch.max(inputs) > right:
106 |         raise ValueError('Input to a transform is not within its domain')
107 | 
108 |     num_bins = unnormalized_widths.shape[-1]
109 | 
110 |     if min_bin_width * num_bins > 1.0:
111 |         raise ValueError('Minimal bin width too large for the number of bins')
112 |     if min_bin_height * num_bins > 1.0:
113 |         raise ValueError('Minimal bin height too large for the number of bins')
114 | 
115 |     widths = F.softmax(unnormalized_widths, dim=-1)
116 |     widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
117 |     cumwidths = torch.cumsum(widths, dim=-1)
118 |     cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0)
119 |     cumwidths = (right - left) * cumwidths + left
120 |     cumwidths[..., 0] = left
121 |     cumwidths[..., -1] = right
122 |     widths = cumwidths[..., 1:] - cumwidths[..., :-1]
123 | 
124 |     derivatives = min_derivative + F.softplus(unnormalized_derivatives)
125 | 
126 |     heights = F.softmax(unnormalized_heights, dim=-1)
127 |     heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
128 |     cumheights = torch.cumsum(heights, dim=-1)
129 |     cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0)
130 |     cumheights = (top - bottom) * cumheights + bottom
131 |     cumheights[..., 0] = bottom
132 |     cumheights[..., -1] = top
133 |     heights = cumheights[..., 1:] - cumheights[..., :-1]
134 | 
135 |     if inverse:
136 |         bin_idx = searchsorted(cumheights, inputs)[..., None]
137 |     else:
138 |         bin_idx = searchsorted(cumwidths, inputs)[..., None]
139 | 
140 |     input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
141 |     input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
142 | 
143 |     input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
144 |     delta = heights / widths
145 |     input_delta = delta.gather(-1, bin_idx)[..., 0]
146 | 
147 |     input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
148 |     input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
149 | 
150 |     input_heights = heights.gather(-1, bin_idx)[..., 0]
151 | 
152 |     if inverse:
153 |         a = (((inputs - input_cumheights) * (input_derivatives
154 |                                              + input_derivatives_plus_one
155 |                                              - 2 * input_delta)
156 |               + input_heights * (input_delta - input_derivatives)))
157 |         b = (input_heights * input_derivatives
158 |              - (inputs - input_cumheights) * (input_derivatives
159 |                                               + input_derivatives_plus_one
160 |                                               - 2 * input_delta))
161 |         c = - input_delta * (inputs - input_cumheights)
162 | 
163 |         discriminant = b.pow(2) - 4 * a * c
164 |         assert (discriminant >= 0).all()
165 | 
166 |         root = (2 * c) / (-b - torch.sqrt(discriminant))
167 |         outputs = root * input_bin_widths + input_cumwidths
168 | 
169 |         theta_one_minus_theta = root * (1 - root)
170 |         denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
171 |                                      * theta_one_minus_theta)
172 |         derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2)
173 |                                                      + 2 * input_delta * theta_one_minus_theta
174 |                                                      + input_derivatives * (1 - root).pow(2))
175 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
176 | 
177 |         return outputs, -logabsdet
178 |     else:
179 |         theta = (inputs - input_cumwidths) / input_bin_widths
180 |         theta_one_minus_theta = theta * (1 - theta)
181 | 
182 |         numerator = input_heights * (input_delta * theta.pow(2)
183 |                                      + input_derivatives * theta_one_minus_theta)
184 |         denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
185 |                                      * theta_one_minus_theta)
186 |         outputs = input_cumheights + numerator / denominator
187 | 
188 |         derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2)
189 |                                                      + 2 * input_delta * theta_one_minus_theta
190 |                                                      + input_derivatives * (1 - theta).pow(2))
191 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
192 | 
193 |         return outputs, logabsdet
194 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import sys
  4 | import argparse
  5 | import logging
  6 | import json
  7 | import subprocess
  8 | import numpy as np
  9 | from scipy.io.wavfile import read
 10 | import torch
 11 | 
 12 | MATPLOTLIB_FLAG = False
 13 | 
 14 | logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
 15 | logger = logging
 16 | 
 17 | 
 18 | def load_checkpoint(checkpoint_path, model, optimizer=None):
 19 |   assert os.path.isfile(checkpoint_path)
 20 |   checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 21 |   iteration = checkpoint_dict['iteration']
 22 |   learning_rate = checkpoint_dict['learning_rate']
 23 |   if optimizer is not None:
 24 |     optimizer.load_state_dict(checkpoint_dict['optimizer'])
 25 |   saved_state_dict = checkpoint_dict['model']
 26 |   if hasattr(model, 'module'):
 27 |     state_dict = model.module.state_dict()
 28 |   else:
 29 |     state_dict = model.state_dict()
 30 |   new_state_dict= {}
 31 |   for k, v in state_dict.items():
 32 |     try:
 33 |       new_state_dict[k] = saved_state_dict[k]
 34 |     except:
 35 |       logger.info("%s is not in the checkpoint" % k)
 36 |       new_state_dict[k] = v
 37 |   if hasattr(model, 'module'):
 38 |     model.module.load_state_dict(new_state_dict)
 39 |   else:
 40 |     model.load_state_dict(new_state_dict)
 41 |   logger.info("Loaded checkpoint '{}' (iteration {})" .format(
 42 |     checkpoint_path, iteration))
 43 |   return model, optimizer, learning_rate, iteration
 44 | 
 45 | 
 46 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path):
 47 |   logger.info("Saving model and optimizer state at iteration {} to {}".format(
 48 |     iteration, checkpoint_path))
 49 |   if hasattr(model, 'module'):
 50 |     state_dict = model.module.state_dict()
 51 |   else:
 52 |     state_dict = model.state_dict()
 53 |   torch.save({'model': state_dict,
 54 |               'iteration': iteration,
 55 |               'optimizer': optimizer.state_dict(),
 56 |               'learning_rate': learning_rate}, checkpoint_path)
 57 | 
 58 | 
 59 | def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050):
 60 |   for k, v in scalars.items():
 61 |     writer.add_scalar(k, v, global_step)
 62 |   for k, v in histograms.items():
 63 |     writer.add_histogram(k, v, global_step)
 64 |   for k, v in images.items():
 65 |     writer.add_image(k, v, global_step, dataformats='HWC')
 66 |   for k, v in audios.items():
 67 |     writer.add_audio(k, v, global_step, audio_sampling_rate)
 68 | 
 69 | 
 70 | def latest_checkpoint_path(dir_path, regex="G_*.pth"):
 71 |   f_list = glob.glob(os.path.join(dir_path, regex))
 72 |   f_list.sort(key=lambda f: int("".join(filter(str.isdigit, f))))
 73 |   x = f_list[-1]
 74 |   print(x)
 75 |   return x
 76 | 
 77 | 
 78 | def plot_spectrogram_to_numpy(spectrogram):
 79 |   global MATPLOTLIB_FLAG
 80 |   if not MATPLOTLIB_FLAG:
 81 |     import matplotlib
 82 |     matplotlib.use("Agg")
 83 |     MATPLOTLIB_FLAG = True
 84 |     mpl_logger = logging.getLogger('matplotlib')
 85 |     mpl_logger.setLevel(logging.WARNING)
 86 |   import matplotlib.pylab as plt
 87 |   import numpy as np
 88 |   
 89 |   fig, ax = plt.subplots(figsize=(10,2))
 90 |   im = ax.imshow(spectrogram, aspect="auto", origin="lower",
 91 |                   interpolation='none')
 92 |   plt.colorbar(im, ax=ax)
 93 |   plt.xlabel("Frames")
 94 |   plt.ylabel("Channels")
 95 |   plt.tight_layout()
 96 | 
 97 |   fig.canvas.draw()
 98 |   data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
 99 |   data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
100 |   plt.close()
101 |   return data
102 | 
103 | 
104 | def plot_alignment_to_numpy(alignment, info=None):
105 |   global MATPLOTLIB_FLAG
106 |   if not MATPLOTLIB_FLAG:
107 |     import matplotlib
108 |     matplotlib.use("Agg")
109 |     MATPLOTLIB_FLAG = True
110 |     mpl_logger = logging.getLogger('matplotlib')
111 |     mpl_logger.setLevel(logging.WARNING)
112 |   import matplotlib.pylab as plt
113 |   import numpy as np
114 | 
115 |   fig, ax = plt.subplots(figsize=(6, 4))
116 |   im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower',
117 |                   interpolation='none')
118 |   fig.colorbar(im, ax=ax)
119 |   xlabel = 'Decoder timestep'
120 |   if info is not None:
121 |       xlabel += '\n\n' + info
122 |   plt.xlabel(xlabel)
123 |   plt.ylabel('Encoder timestep')
124 |   plt.tight_layout()
125 | 
126 |   fig.canvas.draw()
127 |   data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
128 |   data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
129 |   plt.close()
130 |   return data
131 | 
132 | 
133 | def load_wav_to_torch(full_path):
134 |   sampling_rate, data = read(full_path)
135 |   return torch.FloatTensor(data.astype(np.float32)), sampling_rate
136 | 
137 | 
138 | def load_filepaths_and_text(filename, split="|"):
139 |   with open(filename, encoding='utf-8') as f:
140 |     filepaths_and_text = [line.strip().split(split) for line in f]
141 |   return filepaths_and_text
142 | 
143 | 
144 | def get_hparams(init=True):
145 |   parser = argparse.ArgumentParser()
146 |   parser.add_argument('-c', '--config', type=str, default="./configs/base.json",
147 |                       help='JSON file for configuration')
148 |   parser.add_argument('-m', '--model', type=str, required=True,
149 |                       help='Model name')
150 |   
151 |   args = parser.parse_args()
152 |   model_dir = os.path.join("./logs", args.model)
153 | 
154 |   if not os.path.exists(model_dir):
155 |     os.makedirs(model_dir)
156 | 
157 |   config_path = args.config
158 |   config_save_path = os.path.join(model_dir, "config.json")
159 |   if init:
160 |     with open(config_path, "r") as f:
161 |       data = f.read()
162 |     with open(config_save_path, "w") as f:
163 |       f.write(data)
164 |   else:
165 |     with open(config_save_path, "r") as f:
166 |       data = f.read()
167 |   config = json.loads(data)
168 |   
169 |   hparams = HParams(**config)
170 |   hparams.model_dir = model_dir
171 |   return hparams
172 | 
173 | 
174 | def get_hparams_from_dir(model_dir):
175 |   config_save_path = os.path.join(model_dir, "config.json")
176 |   with open(config_save_path, "r") as f:
177 |     data = f.read()
178 |   config = json.loads(data)
179 | 
180 |   hparams =HParams(**config)
181 |   hparams.model_dir = model_dir
182 |   return hparams
183 | 
184 | 
185 | def get_hparams_from_file(config_path):
186 |   with open(config_path, "r") as f:
187 |     data = f.read()
188 |   config = json.loads(data)
189 | 
190 |   hparams =HParams(**config)
191 |   return hparams
192 | 
193 | 
194 | def check_git_hash(model_dir):
195 |   source_dir = os.path.dirname(os.path.realpath(__file__))
196 |   if not os.path.exists(os.path.join(source_dir, ".git")):
197 |     logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format(
198 |       source_dir
199 |     ))
200 |     return
201 | 
202 |   cur_hash = subprocess.getoutput("git rev-parse HEAD")
203 | 
204 |   path = os.path.join(model_dir, "githash")
205 |   if os.path.exists(path):
206 |     saved_hash = open(path).read()
207 |     if saved_hash != cur_hash:
208 |       logger.warn("git hash values are different. {}(saved) != {}(current)".format(
209 |         saved_hash[:8], cur_hash[:8]))
210 |   else:
211 |     open(path, "w").write(cur_hash)
212 | 
213 | 
214 | def get_logger(model_dir, filename="train.log"):
215 |   global logger
216 |   logger = logging.getLogger(os.path.basename(model_dir))
217 |   logger.setLevel(logging.DEBUG)
218 |   
219 |   formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s")
220 |   if not os.path.exists(model_dir):
221 |     os.makedirs(model_dir)
222 |   h = logging.FileHandler(os.path.join(model_dir, filename))
223 |   h.setLevel(logging.DEBUG)
224 |   h.setFormatter(formatter)
225 |   logger.addHandler(h)
226 |   return logger
227 | 
228 | 
229 | class HParams():
230 |   def __init__(self, **kwargs):
231 |     for k, v in kwargs.items():
232 |       if type(v) == dict:
233 |         v = HParams(**v)
234 |       self[k] = v
235 |     
236 |   def keys(self):
237 |     return self.__dict__.keys()
238 | 
239 |   def items(self):
240 |     return self.__dict__.items()
241 | 
242 |   def values(self):
243 |     return self.__dict__.values()
244 | 
245 |   def __len__(self):
246 |     return len(self.__dict__)
247 | 
248 |   def __getitem__(self, key):
249 |     return getattr(self, key)
250 | 
251 |   def __setitem__(self, key, value):
252 |     return setattr(self, key, value)
253 | 
254 |   def __contains__(self, key):
255 |     return key in self.__dict__
256 | 
257 |   def __repr__(self):
258 |     return self.__dict__.__repr__()
259 | 


--------------------------------------------------------------------------------
/vsinging_debug.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import numpy as np
 4 | 
 5 | from scipy.io import wavfile
 6 | from time import *
 7 | 
 8 | import torch
 9 | import utils
10 | from models import SynthesizerTrn
11 | 
12 | 
13 | def save_wav(wav, path, rate):
14 |     wav *= 32767 / max(0.01, np.max(np.abs(wav))) * 0.6
15 |     wavfile.write(path, rate, wav.astype(np.int16))
16 | 
17 | 
18 | # define model and load checkpoint
19 | hps = utils.get_hparams_from_file("./configs/singing_base.json")
20 | 
21 | net_g = SynthesizerTrn(
22 |     hps.data.filter_length // 2 + 1,
23 |     hps.train.segment_size // hps.data.hop_length,
24 |     **hps.model,
25 | ).cuda()
26 | 
27 | _ = utils.load_checkpoint("./logs/singing_base/G_160000.pth", net_g, None)
28 | net_g.eval()
29 | # net_g.remove_weight_norm()
30 | 
31 | # check directory existence
32 | if not os.path.exists("./singing_out"):
33 |     os.makedirs("./singing_out")
34 | 
35 | idxs = [
36 |     "2001000001",
37 |     "2001000002",
38 |     "2001000003",
39 |     "2001000004",
40 |     "2001000005",
41 |     "2001000006",
42 |     "2051001912",
43 |     "2051001913",
44 |     "2051001914",
45 |     "2051001915",
46 |     "2051001916",
47 |     "2051001917",
48 | ]
49 | for idx in idxs:
50 |     phone = np.load(f"../VISinger_data/label_vits/{idx}_label.npy")
51 |     score = np.load(f"../VISinger_data/label_vits/{idx}_score.npy")
52 |     pitch = np.load(f"../VISinger_data/label_vits/{idx}_pitch.npy")
53 |     slurs = np.load(f"../VISinger_data/label_vits/{idx}_slurs.npy")
54 |     phone = torch.LongTensor(phone)
55 |     score = torch.LongTensor(score)
56 |     pitch = torch.LongTensor(pitch)
57 |     slurs = torch.LongTensor(slurs)
58 | 
59 |     phone_lengths = phone.size()[0]
60 | 
61 |     begin_time = time()
62 |     with torch.no_grad():
63 |         phone = phone.cuda().unsqueeze(0)
64 |         score = score.cuda().unsqueeze(0)
65 |         pitch = pitch.cuda().unsqueeze(0)
66 |         slurs = slurs.cuda().unsqueeze(0)
67 |         phone_lengths = torch.LongTensor([phone_lengths]).cuda()
68 |         audio = (
69 |             net_g.infer(phone, phone_lengths, score, pitch, slurs)[0][0, 0]
70 |             .data.cpu()
71 |             .float()
72 |             .numpy()
73 |         )
74 |     end_time = time()
75 |     run_time = end_time - begin_time
76 |     print("Syth Time (Seconds):", run_time)
77 |     data_len = len(audio) / 16000
78 |     print("Wave Time (Seconds):", data_len)
79 |     print("Real time Rate (%):", run_time / data_len)
80 |     save_wav(audio, f"./singing_out/singing_{idx}.wav", hps.data.sampling_rate)
81 | 
82 | # can be deleted
83 | os.system("chmod 777 ./singing_out -R")
84 | 


--------------------------------------------------------------------------------
/vsinging_infer.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | 
  5 | from scipy.io import wavfile
  6 | from time import *
  7 | 
  8 | import torch
  9 | import utils
 10 | from models import Synthesizer
 11 | from prepare.data_vits import SingInput
 12 | from prepare.data_vits import FeatureInput
 13 | from prepare.phone_map import get_vocab_size
 14 | 
 15 | 
 16 | def save_wav(wav, path, rate):
 17 |     wav *= 32767 / max(0.01, np.max(np.abs(wav))) * 0.6
 18 |     wavfile.write(path, rate, wav.astype(np.int16))
 19 | 
 20 | 
 21 | use_cuda = True
 22 | 
 23 | # define model and load checkpoint
 24 | hps = utils.get_hparams_from_file("./configs/singing_base.json")
 25 | 
 26 | vocab_size = get_vocab_size()
 27 | 
 28 | model_path = "./logs/singing_base/"
 29 | saved_models = os.listdir(model_path)
 30 | iter_nums = []
 31 | for i in range(len(saved_models)):
 32 |     if os.path.splitext(saved_models[i])[1] == ".pth" and "G" in saved_models[i]:
 33 |         iter_nums.append(int(os.path.splitext(saved_models[i])[0][2:]))
 34 | iter_nums = sorted(iter_nums, reverse=True)
 35 | 
 36 | print("start infering (G_" + str(iter_nums[0]) + ".pth)")
 37 | 
 38 | net_g = Synthesizer(
 39 |     vocab_size,
 40 |     hps.data.filter_length // 2 + 1,
 41 |     hps.train.segment_size // hps.data.hop_length,
 42 |     **hps.model,
 43 | )  # .cuda()
 44 | 
 45 | if use_cuda:
 46 |     net_g = net_g.cuda()
 47 | 
 48 | _ = utils.load_checkpoint(
 49 |     "./logs/singing_base/G_" + str(iter_nums[0]) + ".pth", net_g, None
 50 | )
 51 | 
 52 | net_g.eval()
 53 | # net_g.remove_weight_norm()
 54 | 
 55 | singInput = SingInput(hps.data.sampling_rate, hps.data.hop_length)
 56 | featureInput = FeatureInput(
 57 |     "../VISinger_data/wav_dump_16k/", hps.data.sampling_rate, hps.data.hop_length
 58 | )
 59 | 
 60 | # check directory existence
 61 | if not os.path.exists("./singing_out"):
 62 |     os.makedirs("./singing_out")
 63 | 
 64 | fo = open("./vsinging_infer.txt", "r+")
 65 | while True:
 66 |     try:
 67 |         message = fo.readline().strip()
 68 |     except Exception as e:
 69 |         print("nothing of except:", e)
 70 |         break
 71 |     if message == None:
 72 |         break
 73 |     if message == "":
 74 |         break
 75 |     print(message)
 76 |     (
 77 |         file,
 78 |         labels_ids,
 79 |         labels_frames,
 80 |         scores_ids,
 81 |         scores_dur,
 82 |         labels_slr,
 83 |         labels_uvs,
 84 |     ) = singInput.parseInput(message)
 85 | 
 86 |     phone = torch.LongTensor(labels_ids)
 87 |     score = torch.LongTensor(scores_ids)
 88 |     score_dur = torch.LongTensor(scores_dur)
 89 |     slurs = torch.LongTensor(labels_slr)
 90 | 
 91 |     phone_lengths = phone.size()[0]
 92 | 
 93 |     begin_time = time()
 94 |     with torch.no_grad():
 95 |         if use_cuda:
 96 |             phone = phone.cuda().unsqueeze(0)
 97 |             score = score.cuda().unsqueeze(0)
 98 |             score_dur = score_dur.cuda().unsqueeze(0)
 99 |             slurs = slurs.cuda().unsqueeze(0)
100 |             phone_lengths = torch.LongTensor([phone_lengths]).cuda()
101 |         else:
102 |             phone = phone.unsqueeze(0)
103 |             score = score.unsqueeze(0)
104 |             score_dur = score_dur.unsqueeze(0)
105 |             slurs = slurs.unsqueeze(0)
106 |             phone_lengths = torch.LongTensor([phone_lengths])
107 |         audio = (
108 |             net_g.infer(phone, phone_lengths, score, score_dur, slurs)[0][0, 0]
109 |             .data.cpu()
110 |             .float()
111 |             .numpy()
112 |         )
113 |     end_time = time()
114 |     run_time = end_time - begin_time
115 |     print("Syth Time (Seconds):", run_time)
116 |     data_len = len(audio) / hps.data.sampling_rate
117 |     print("Wave Time (Seconds):", data_len)
118 |     print("Real time Rate (%):", run_time / data_len)
119 |     save_wav(audio, f"./singing_out/{file}.wav", hps.data.sampling_rate)
120 | fo.close()
121 | # can be deleted
122 | os.system("chmod 777 ./singing_out -R")
123 | 


--------------------------------------------------------------------------------
/vsinging_infer_jp.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | 
  5 | from scipy.io import wavfile
  6 | from time import *
  7 | 
  8 | import torch
  9 | import utils
 10 | from models import SynthesizerTrn
 11 | from prepare.data_vits_phn_ofuton import SingInput
 12 | from prepare.data_vits_phn_ofuton import FeatureInput
 13 | from prepare.phone_map import get_vocab_size
 14 | 
 15 | 
 16 | def save_wav(wav, path, rate):
 17 |     wav *= 32767 / max(0.01, np.max(np.abs(wav))) * 0.6
 18 |     wavfile.write(path, rate, wav.astype(np.int16))
 19 | 
 20 | 
 21 | use_cuda = True
 22 | 
 23 | # define model and load checkpoint
 24 | hps = utils.get_hparams_from_file("./configs/singing_base.json")
 25 | 
 26 | vocab_size = get_vocab_size()
 27 | 
 28 | net_g = SynthesizerTrn(
 29 |     vocab_size,
 30 |     hps.data.filter_length // 2 + 1,
 31 |     hps.train.segment_size // hps.data.hop_length,
 32 |     **hps.model,
 33 | )  # .cuda()
 34 | 
 35 | if use_cuda:
 36 |     net_g = net_g.cuda()
 37 | 
 38 | _ = utils.load_checkpoint("./logs/singing_base/G_40000.pth", net_g, None)
 39 | net_g.eval()
 40 | # net_g.remove_weight_norm()
 41 | 
 42 | singInput = SingInput(hps.data.sampling_rate, hps.data.hop_length)
 43 | featureInput = FeatureInput(
 44 |     "../VISinger_data/wav_dump_16k/", hps.data.sampling_rate, hps.data.hop_length
 45 | )
 46 | 
 47 | # check directory existence
 48 | if not os.path.exists("./singing_out"):
 49 |     os.makedirs("./singing_out")
 50 | 
 51 | fo = open("./vsinging_infer_jp.txt", "r+")
 52 | while True:
 53 |     try:
 54 |         message = fo.readline().strip()
 55 |     except Exception as e:
 56 |         print("nothing of except:", e)
 57 |         break
 58 |     if message == None:
 59 |         break
 60 |     if message == "":
 61 |         break
 62 |     print(message)
 63 |     (
 64 |         file,
 65 |         labels_ids,
 66 |         labels_frames,
 67 |         scores_ids,
 68 |         scores_dur,
 69 |         labels_slr,
 70 |         # labels_uvs,
 71 |     ) = singInput.parseInput(message)
 72 | 
 73 |     phone = torch.LongTensor(labels_ids)
 74 |     score = torch.LongTensor(scores_ids)
 75 |     score_dur = torch.LongTensor(scores_dur)
 76 |     slurs = torch.LongTensor(labels_slr)
 77 | 
 78 |     phone_lengths = phone.size()[0]
 79 | 
 80 |     begin_time = time()
 81 |     with torch.no_grad():
 82 |         if use_cuda:
 83 |             phone = phone.cuda().unsqueeze(0)
 84 |             score = score.cuda().unsqueeze(0)
 85 |             score_dur = score_dur.cuda().unsqueeze(0)
 86 |             slurs = slurs.cuda().unsqueeze(0)
 87 |             phone_lengths = torch.LongTensor([phone_lengths]).cuda()
 88 |         else:
 89 |             phone = phone.unsqueeze(0)
 90 |             score = score.unsqueeze(0)
 91 |             score_dur = score_dur.unsqueeze(0)
 92 |             slurs = slurs.unsqueeze(0)
 93 |             phone_lengths = torch.LongTensor([phone_lengths])
 94 |         audio = (
 95 |             net_g.infer(phone, phone_lengths, score, score_dur, slurs)[0][0, 0]
 96 |             .data.cpu()
 97 |             .float()
 98 |             .numpy()
 99 |         )
100 |     end_time = time()
101 |     run_time = end_time - begin_time
102 |     print("Syth Time (Seconds):", run_time)
103 |     data_len = len(audio) / hps.data.sampling_rate
104 |     print("Wave Time (Seconds):", data_len)
105 |     print("Real time Rate (%):", run_time / data_len)
106 |     save_wav(audio, f"./singing_out/{file}.wav", hps.data.sampling_rate)
107 | fo.close()
108 | # can be deleted
109 | os.system("chmod 777 ./singing_out -R")
110 | 


--------------------------------------------------------------------------------
/vsinging_song.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | 
  4 | from scipy.io import wavfile
  5 | from time import *
  6 | 
  7 | import torch
  8 | import utils
  9 | from models import Synthesizer
 10 | from prepare.data_vits import SingInput
 11 | from prepare.data_vits import FeatureInput
 12 | 
 13 | 
 14 | def save_wav(wav, path, rate):
 15 |     wav *= 32767 / max(0.01, np.max(np.abs(wav))) * 0.6
 16 |     wavfile.write(path, rate, wav.astype(np.int16))
 17 | 
 18 | 
 19 | # define model and load checkpoint
 20 | hps = utils.get_hparams_from_file("./configs/singing_base.json")
 21 | 
 22 | net_g = Synthesizer(
 23 |     hps.data.filter_length // 2 + 1,
 24 |     hps.train.segment_size // hps.data.hop_length,
 25 |     **hps.model,
 26 | )
 27 | 
 28 | # _ = utils.load_checkpoint("./logs/singing_base/G_160000.pth", net_g, None)
 29 | # net_g.remove_weight_norm()
 30 | # torch.save(net_g, "visinger.pth")
 31 | net_g = torch.load("visinger.pth", map_location="cpu")
 32 | net_g.eval().cuda()
 33 | # net_g.remove_weight_norm()
 34 | 
 35 | singInput = SingInput(16000, 256)
 36 | featureInput = FeatureInput("../VISinger_data/wav_dump_16k/", 16000, 256)
 37 | 
 38 | # check directory existence
 39 | if not os.path.exists("./singing_out"):
 40 |     os.makedirs("./singing_out")
 41 | 
 42 | fo = open("./vsinging_song_midi.txt", "r+")
 43 | song_rate = 16000
 44 | song_time = fo.readline().strip().split("|")[1]
 45 | song_length = int(song_rate * (float(song_time) + 30))
 46 | song_data = np.zeros(song_length, dtype="float32")
 47 | while True:
 48 |     try:
 49 |         message = fo.readline().strip()
 50 |     except Exception as e:
 51 |         print("nothing of except:", e)
 52 |         break
 53 |     if message == None:
 54 |         break
 55 |     if message == "":
 56 |         break
 57 |     (
 58 |         item_indx,
 59 |         item_time,
 60 |         labels_ids,
 61 |         labels_frames,
 62 |         scores_ids,
 63 |         scores_dur,
 64 |         labels_slr,
 65 |         labels_uvs,
 66 |     ) = singInput.parseSong(message)
 67 |     labels_ids = singInput.expandInput(labels_ids, labels_frames)
 68 |     labels_uvs = singInput.expandInput(labels_uvs, labels_frames)
 69 |     labels_slr = singInput.expandInput(labels_slr, labels_frames)
 70 |     scores_ids = singInput.expandInput(scores_ids, labels_frames)
 71 |     scores_pit = singInput.scorePitch(scores_ids)
 72 |     # elments by elments
 73 |     scores_pit = scores_pit * labels_uvs
 74 |     # scores_pit = singInput.smoothPitch(scores_pit)
 75 |     # scores_pit = scores_pit * labels_uvs
 76 |     phone = torch.LongTensor(labels_ids)
 77 |     score = torch.LongTensor(scores_ids)
 78 |     slurs = torch.LongTensor(labels_slr)
 79 |     pitch = featureInput.coarse_f0(scores_pit)
 80 |     pitch = torch.LongTensor(pitch)
 81 | 
 82 |     phone_lengths = phone.size()[0]
 83 | 
 84 |     begin_time = time()
 85 |     with torch.no_grad():
 86 |         phone = phone.cuda().unsqueeze(0)
 87 |         score = score.cuda().unsqueeze(0)
 88 |         pitch = pitch.cuda().unsqueeze(0)
 89 |         slurs = slurs.cuda().unsqueeze(0)
 90 |         phone_lengths = torch.LongTensor([phone_lengths]).cuda()
 91 |         audio = (
 92 |             net_g.infer(phone, phone_lengths, score, pitch, slurs)[0][0, 0]
 93 |             .data.cpu()
 94 |             .float()
 95 |             .numpy()
 96 |         )
 97 |     end_time = time()
 98 |     run_time = end_time - begin_time
 99 |     print("Syth Time (Seconds):", run_time)
100 |     data_len = len(audio) / 16000
101 |     print("Wave Time (Seconds):", data_len)
102 |     print("Real time Rate (%):", run_time / data_len)
103 |     save_wav(audio, f"./singing_out/{item_indx}.wav", hps.data.sampling_rate)
104 |     # wav
105 |     item_start = int(song_rate * float(item_time))
106 |     item_end = item_start + len(audio)
107 |     song_data[item_start:item_end] = audio
108 | # out of for
109 | song_data = np.array(song_data, dtype="float32")
110 | save_wav(song_data, f"./singing_out/_song.wav", hps.data.sampling_rate)
111 | fo.close()
112 | # can be deleted
113 | os.system("chmod 777 ./singing_out -R")
114 | 


--------------------------------------------------------------------------------
/vsinging_song_midi.txt:
--------------------------------------------------------------------------------
 1 | song_time|116.88723672656248
 2 | 0|0000.694| 化 外 山 间 岁 月 皆 看 老|h ua w ai sh an j ian s ui y ve j ie k an l ao|57 57 64 64 62 62 60 60 59 59 60 60 62 62 64 64 57 57|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.506 0.506|0.064 0.249 0.088 0.249 0.088 0.273 0.064 0.249 0.088 0.249 0.088 0.273 0.064 0.273 0.064 0.241 0.096 0.506|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 3 | 1|0006.140| 洛 雪 无 声 天 地 掩 尘 嚣|l uo x ve w u sh eng t ian d i y an ch en x iao|57 57 64 64 62 62 60 60 59 59 60 60 62 62 64 64 69 69|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.590 0.590|0.096 0.249 0.088 0.249 0.088 0.249 0.088 0.305 0.032 0.305 0.032 0.249 0.088 0.273 0.064 0.249 0.088 0.590|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 4 | 2|0010.923| 他 看 尽 晨 曦 日 暮 AP 饮 罢 腰 间 酒 一 壶 AP 依 稀 当 年 孤 旅 踏 苍 霞 尽 处|t a k an j in ch en x i r i m u AP y in b a y ao j ian j iu y i h u AP y i x i d ang n ian g u l v t a c ang x ia j in ch u|60 60 62 62 64 64 62 62 67 67 64 64 62 62 rest 64 64 67 67 72 72 71 71 69 69 67 67 69 69 rest 67 67 64 64 62 62 64 64 62 62 60 60 59 59 60 60 62 62 64 64 57 57|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.253 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 1.180 1.180|0.032 0.273 0.064 0.273 0.064 0.273 0.064 0.249 0.088 0.249 0.088 0.249 0.088 0.337 0.249 0.088 0.297 0.040 0.249 0.088 0.273 0.064 0.273 0.064 0.249 0.088 0.273 0.064 0.421 0.165 0.088 0.249 0.088 0.305 0.032 0.249 0.088 0.273 0.064 0.241 0.096 0.305 0.032 0.249 0.088 0.249 0.088 0.273 0.064 0.273 0.064 1.180|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 5 | 3|0021.678| 风 霜 冷 冽 他 眉 目 AP 时 光 雕 琢 他 风 骨 AP 浮 世 南 柯 一 梦 冷 暖 都 藏 住|f eng sh uang l eng l ie t a m ei m u AP sh i g uang d iao z uo t a f eng g u AP f u sh i n an k e y i m eng l eng n uan d ou c ang zh u|64 64 67 67 69 69 67 67 72 72 69 69 67 67 rest 64 64 62 62 64 64 62 62 67 67 64 64 60 60 rest 57 57 60 60 64 64 62 62 60 60 57 57 60 60 57 57 67 67 62 62 64 64|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674|0.064 0.249 0.088 0.241 0.096 0.241 0.096 0.305 0.032 0.249 0.088 0.249 0.088 0.337 0.249 0.088 0.273 0.064 0.305 0.032 0.249 0.088 0.305 0.032 0.273 0.064 0.273 0.064 0.337 0.273 0.064 0.249 0.088 0.249 0.088 0.273 0.064 0.249 0.088 0.249 0.088 0.241 0.096 0.249 0.088 0.305 0.032 0.249 0.088 0.273 0.064 0.674|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 6 | 4|0032.356| 哪 杯 酒 烫 过 肺 腑 AP 曾 换 他 睥 睨 一 顾 AP 剑 破 乾 坤 轮 转 山 河 倾 覆|n a b ei j iu t ang g uo f ei f u AP c eng h uan t a p i n i y i g u AP j ian p o q ian k un l un zh uan sh an h e q ing f u|64 64 67 67 69 69 67 67 72 72 69 69 67 67 rest 64 64 62 62 64 64 62 62 67 67 64 64 60 60 rest 57 57 64 64 62 62 64 64 67 67 62 62 60 60 59 59 60 60 57 57|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674 0.337 0.337 0.337 0.337 1.348 1.348|0.088 0.297 0.040 0.273 0.064 0.305 0.032 0.273 0.064 0.273 0.064 0.273 0.064 0.337 0.249 0.088 0.273 0.064 0.305 0.032 0.249 0.088 0.249 0.088 0.249 0.088 0.273 0.064 0.337 0.273 0.064 0.249 0.088 0.241 0.096 0.273 0.064 0.241 0.096 0.273 0.064 0.249 0.088 0.610 0.064 0.241 0.096 0.273 0.064 1.348|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 7 | 5|0043.620| 他 三 清 尘 外 剔 去 心 中 毒|t a s an q ing ch en w ai t i q v x in zh ong d u|57 57 60 60 64 64 62 62 60 60 59 59 60 60 62 62 64 64 57 57|0.169 0.169 0.169 0.169 0.674 0.674 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.590 0.590|0.032 0.081 0.088 0.073 0.096 0.610 0.064 0.249 0.088 0.305 0.032 0.241 0.096 0.249 0.088 0.273 0.064 0.305 0.032 0.590|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 8 | 6|0048.981| 尝 世 间 百 味 甘 醇 与 涩 苦|ch ang sh i j ian b ai w ei g an ch un y v s e k u|57 57 60 60 64 64 62 62 60 60 59 59 60 60 62 62 64 64 69 69|0.169 0.169 0.169 0.169 0.674 0.674 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 1.180 1.180|0.064 0.081 0.088 0.105 0.064 0.634 0.040 0.249 0.088 0.273 0.064 0.273 0.064 0.249 0.088 0.249 0.088 0.273 0.064 1.180|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 9 | 7|0053.929| 曾 有 谁 偏 执 不 悟 AP 谈 笑 斗 酒 至 酣 处 AP 而 今 不 过 拍 去 肩 上 红 尘 土|c eng y ou sh ui p ian zh i b u w u AP t an x iao d ou j iu zh i h an ch u AP er j in b u g uo p ai q v j ian sh ang h ong ch en t u|60 60 62 62 64 64 67 67 64 64 67 67 62 62 rest 62 62 67 67 72 72 71 71 69 69 67 67 69 69 rest 67 64 64 62 62 62 62 64 64 67 67 60 60 60 60 59 59 60 60 57 57|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674|0.088 0.249 0.088 0.249 0.088 0.249 0.088 0.273 0.064 0.297 0.040 0.249 0.088 0.337 0.305 0.032 0.249 0.088 0.305 0.032 0.273 0.064 0.273 0.064 0.273 0.064 0.273 0.064 0.337 0.337 0.273 0.064 0.297 0.040 0.273 0.064 0.249 0.088 0.241 0.096 0.273 0.064 0.249 0.088 0.273 0.064 0.273 0.064 0.305 0.032 0.674|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 | 8|0064.655| 风 霜 冷 冽 他 眉 目 时 光 雕 琢 他 风 骨 浮 世 南 柯 一 梦 冷 暖 都 藏 住|f eng sh uang l eng l ie t a m ei m u sh i g uang d iao z uo t a f eng g u f u sh i n an k e y i m eng l eng n uan d ou c ang zh u|64 64 67 67 69 69 67 67 72 72 69 69 67 67 64 64 62 62 64 64 62 62 67 67 64 64 60 60 57 57 60 60 64 64 62 62 60 60 57 57 60 60 57 57 67 67 62 62 64 64|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.506 0.506|0.064 0.249 0.088 0.241 0.096 0.241 0.096 0.305 0.032 0.249 0.088 0.249 0.088 0.586 0.088 0.273 0.064 0.305 0.032 0.249 0.088 0.305 0.032 0.273 0.064 0.273 0.064 0.610 0.064 0.249 0.088 0.249 0.088 0.273 0.064 0.249 0.088 0.249 0.088 0.241 0.096 0.249 0.088 0.305 0.032 0.249 0.088 0.273 0.064 0.506|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 | 9|0075.418| 哪 杯 酒 烫 过 肺 腑 曾 换 他 睥 睨 一 顾 AP 剑 破 乾 坤 轮 转 山 河 倾 覆|n a b ei j iu t ang g uo f ei f u c eng h uan t a p i n i y i g u AP j ian p o q ian k un l un zh uan sh an h e q ing f u|64 64 67 67 69 69 67 67 72 72 69 69 67 67 64 64 62 62 64 64 62 62 67 67 64 64 60 60 rest 57 57 64 64 62 62 64 64 67 67 62 62 60 60 59 59 60 60 57 57|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.253 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674 0.337 0.337 0.337 0.337 0.674 0.674|0.088 0.297 0.040 0.273 0.064 0.305 0.032 0.273 0.064 0.273 0.064 0.273 0.064 0.586 0.088 0.273 0.064 0.305 0.032 0.249 0.088 0.249 0.088 0.249 0.088 0.273 0.064 0.421 0.189 0.064 0.249 0.088 0.241 0.096 0.273 0.064 0.241 0.096 0.273 0.064 0.249 0.088 0.610 0.064 0.241 0.096 0.273 0.064 0.674|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 | 10|0086.260| 到 最 后 沧 海 一 粟 AP 何 必 江 湖 多 殊 途 AP 当 年 论 剑 峰 顶 谁 几 笔 成 书|d ao z ui h ou c ang h ai y i s u AP h e b i j iang h u d uo sh u t u AP d ang n ian l un j ian f eng d ing sh ui j i b i ch eng sh u|64 64 67 67 69 69 67 67 72 72 69 69 67 67 rest 64 64 62 62 64 64 62 62 67 67 64 64 60 60 rest 57 57 60 60 64 64 62 62 60 60 57 57 60 60 57 57 67 67 62 62 64 64|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.253 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.253 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674|0.032 0.249 0.088 0.273 0.064 0.249 0.088 0.273 0.064 0.249 0.088 0.249 0.088 0.421 0.189 0.064 0.297 0.040 0.273 0.064 0.273 0.064 0.305 0.032 0.249 0.088 0.305 0.032 0.421 0.221 0.032 0.249 0.088 0.241 0.096 0.273 0.064 0.273 0.064 0.305 0.032 0.249 0.088 0.273 0.064 0.297 0.040 0.273 0.064 0.249 0.088 0.674|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 | 11|0096.991| 纵 他 朝 众 生 再 晤 AP 奈 何 明 月 终 辜 负 AP 坐 听 晨 钟 难 算 太 虚 有 无|z ong t a ch ao zh ong sh eng z ai w u AP n ai h e m ing y ve zh ong g u f u AP z uo t ing ch en zh ong n an s uan t ai x v y ou w u|64 64 67 67 69 69 67 67 72 72 69 69 67 67 rest 64 64 62 62 64 64 62 62 67 67 64 64 60 60 rest 57 57 64 64 62 62 64 64 62 62 60 60 59 59 60 60 59 59 57 57|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.253 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.253 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.674 0.674 0.337 0.337 0.169 0.169 1.264 1.264|0.088 0.305 0.032 0.273 0.064 0.273 0.064 0.249 0.088 0.249 0.088 0.249 0.088 0.421 0.165 0.088 0.273 0.064 0.249 0.088 0.249 0.088 0.273 0.064 0.273 0.064 0.273 0.064 0.421 0.165 0.088 0.305 0.032 0.273 0.064 0.273 0.064 0.249 0.088 0.249 0.088 0.305 0.032 0.586 0.088 0.249 0.088 0.081 0.088 1.264|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 | 12|0107.917| 天 道 勘 破 敢 问 一 句 悟 不|t ian d ao k an p o g an w en y i j v w u b u|57 57 64 64 62 62 64 64 62 62 60 60 59 59 60 60 62 62 64 64|0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.337 0.421 0.421 0.506 0.506 0.337 0.337 0.590 0.590|0.032 0.305 0.032 0.273 0.064 0.249 0.088 0.273 0.064 0.249 0.088 0.249 0.088 0.357 0.064 0.418 0.088 0.297 0.040 0.590|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 | 13|0112.496| 悟 悟|w u w u|68 68 69 69|0.506 0.506 3.792 3.792|0.088 0.418 0.088 3.792|0 0 0 0


--------------------------------------------------------------------------------