├── .gitmodules
├── LICENSE
├── README.md
├── audio_processing.py
├── data
    ├── cmu_dictionary
    ├── debussy_prelude_lyrics.musicxml
    ├── example1.wav
    ├── example2.wav
    ├── examples_filelist.txt
    ├── haendel_hallelujah.musicxml
    └── mozart_requiem_kyrie_satb.musicxml
├── data_utils.py
├── distributed.py
├── filelists
    ├── libritts_speakerinfo.txt
    ├── libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt
    ├── libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt
    ├── ljs_audiopaths_text_sid_train_filelist.txt
    └── ljs_audiopaths_text_sid_val_filelist.txt
├── fp16_optimizer.py
├── hparams.py
├── inference.ipynb
├── layers.py
├── logger.py
├── loss_function.py
├── loss_scaler.py
├── mellotron_logo.png
├── mellotron_utils.py
├── model.py
├── modules.py
├── multiproc.py
├── plotting_utils.py
├── requirements.txt
├── stft.py
├── text
    ├── LICENSE
    ├── __init__.py
    ├── cleaners.py
    ├── cmudict.py
    ├── numbers.py
    └── symbols.py
├── train.py
├── utils.py
└── yin.py


/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "waveglow"]
2 | 	path = waveglow
3 | 	url = https://github.com/NVIDIA/waveglow.git
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2019, NVIDIA Corporation
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![Mellotron](mellotron_logo.png "Mellotron")
 2 | 
 3 | ### Rafael Valle\*, Jason Li\*, Ryan Prenger and Bryan Catanzaro
 4 | In our recent [paper] we propose Mellotron: a multispeaker voice synthesis model
 5 | based on Tacotron 2 GST that can make a voice emote and sing without emotive or
 6 | singing training data. 
 7 | 
 8 | By explicitly conditioning on rhythm and continuous pitch
 9 | contours from an audio signal or music score, Mellotron is able to generate
10 | speech in a variety of styles ranging from read speech to expressive speech,
11 | from slow drawls to rap and from monotonous voice to singing voice.
12 | 
13 | Visit our [website] for audio samples.
14 | 
15 | ## Pre-requisites
16 | 1. NVIDIA GPU + CUDA cuDNN
17 | 
18 | ## Setup
19 | 1. Clone this repo: `git clone https://github.com/NVIDIA/mellotron.git`
20 | 2. CD into this repo: `cd mellotron`
21 | 3. Initialize submodule: `git submodule init; git submodule update`
22 | 4. Install [PyTorch]
23 | 5. Install [Apex]
24 | 6. Install python requirements or build docker image 
25 |     - Install python requirements: `pip install -r requirements.txt`
26 | 
27 | ## Training
28 | 1. Update the filelists inside the filelists folder to point to your data
29 | 2. `python train.py --output_directory=outdir --log_directory=logdir`
30 | 3. (OPTIONAL) `tensorboard --logdir=outdir/logdir`
31 | 
32 | ## Training using a pre-trained model
33 | Training using a pre-trained model can lead to faster convergence  
34 | By default, the speaker embedding layer is [ignored]
35 | 
36 | 1. Download our published Mellotron model trained on [LibriTTS] or [LJS]
37 | 2. `python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start`
38 | 
39 | ## Multi-GPU (distributed) and Automatic Mixed Precision Training
40 | 1. `python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True`
41 | 
42 | ## Inference demo
43 | 1. `jupyter notebook --ip=127.0.0.1 --port=31337`
44 | 2. Load inference.ipynb 
45 | 3. (optional) Download our published [WaveGlow](https://drive.google.com/open?id=1okuUstGoBe_qZ4qUEF8CcwEugHP7GM_b) model
46 | 
47 | ## Related repos
48 | [WaveGlow](https://github.com/NVIDIA/WaveGlow) Faster than real time Flow-based
49 | Generative Network for Speech Synthesis.
50 | 
51 | ## Acknowledgements
52 | This implementation uses code from the following repos: [Keith
53 | Ito](https://github.com/keithito/tacotron/), [Prem
54 | Seetharaman](https://github.com/pseeth/pytorch-stft), 
55 | [Chengqi Deng](https://github.com/KinglittleQ/GST-Tacotron),
56 | [Patrice Guyot](https://github.com/patriceguyot/Yin), as described in our code.
57 | 
58 | [ignored]: https://github.com/NVIDIA/mellotron/blob/master/hparams.py#L22
59 | [paper]: https://arxiv.org/abs/1910.11997
60 | [WaveGlow]: https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF
61 | [LibriTTS]: https://drive.google.com/open?id=1ZesPPyRRKloltRIuRnGZ2LIUEuMSVjkI
62 | [LJS]: https://drive.google.com/open?id=1UwDARlUl8JvB2xSuyMFHFsIWELVpgQD4
63 | [pytorch]: https://github.com/pytorch/pytorch#installation
64 | [website]: https://nv-adlr.github.io/Mellotron
65 | [Apex]: https://github.com/nvidia/apex
66 | [AMP]: https://github.com/NVIDIA/apex/tree/master/apex/amp


--------------------------------------------------------------------------------
/audio_processing.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import numpy as np
 3 | from scipy.signal import get_window
 4 | import librosa.util as librosa_util
 5 | 
 6 | 
 7 | def window_sumsquare(window, n_frames, hop_length=200, win_length=800,
 8 |                      n_fft=800, dtype=np.float32, norm=None):
 9 |     """
10 |     # from librosa 0.6
11 |     Compute the sum-square envelope of a window function at a given hop length.
12 | 
13 |     This is used to estimate modulation effects induced by windowing
14 |     observations in short-time fourier transforms.
15 | 
16 |     Parameters
17 |     ----------
18 |     window : string, tuple, number, callable, or list-like
19 |         Window specification, as in `get_window`
20 | 
21 |     n_frames : int > 0
22 |         The number of analysis frames
23 | 
24 |     hop_length : int > 0
25 |         The number of samples to advance between frames
26 | 
27 |     win_length : [optional]
28 |         The length of the window function.  By default, this matches `n_fft`.
29 | 
30 |     n_fft : int > 0
31 |         The length of each analysis frame.
32 | 
33 |     dtype : np.dtype
34 |         The data type of the output
35 | 
36 |     Returns
37 |     -------
38 |     wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`
39 |         The sum-squared envelope of the window function
40 |     """
41 |     if win_length is None:
42 |         win_length = n_fft
43 | 
44 |     n = n_fft + hop_length * (n_frames - 1)
45 |     x = np.zeros(n, dtype=dtype)
46 | 
47 |     # Compute the squared window at the desired length
48 |     win_sq = get_window(window, win_length, fftbins=True)
49 |     win_sq = librosa_util.normalize(win_sq, norm=norm)**2
50 |     win_sq = librosa_util.pad_center(win_sq, n_fft)
51 | 
52 |     # Fill the envelope
53 |     for i in range(n_frames):
54 |         sample = i * hop_length
55 |         x[sample:min(n, sample + n_fft)] += win_sq[:max(0, min(n_fft, n - sample))]
56 |     return x
57 | 
58 | 
59 | def griffin_lim(magnitudes, stft_fn, n_iters=30):
60 |     """
61 |     PARAMS
62 |     ------
63 |     magnitudes: spectrogram magnitudes
64 |     stft_fn: STFT class with transform (STFT) and inverse (ISTFT) methods
65 |     """
66 | 
67 |     angles = np.angle(np.exp(2j * np.pi * np.random.rand(*magnitudes.size())))
68 |     angles = angles.astype(np.float32)
69 |     angles = torch.autograd.Variable(torch.from_numpy(angles))
70 |     signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
71 | 
72 |     for i in range(n_iters):
73 |         _, angles = stft_fn.transform(signal)
74 |         signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
75 |     return signal
76 | 
77 | 
78 | def dynamic_range_compression(x, C=1, clip_val=1e-5):
79 |     """
80 |     PARAMS
81 |     ------
82 |     C: compression factor
83 |     """
84 |     return torch.log(torch.clamp(x, min=clip_val) * C)
85 | 
86 | 
87 | def dynamic_range_decompression(x, C=1):
88 |     """
89 |     PARAMS
90 |     ------
91 |     C: compression factor used to compress
92 |     """
93 |     return torch.exp(x) / C
94 | 


--------------------------------------------------------------------------------
/data/cmu_dictionary:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/data/cmu_dictionary


--------------------------------------------------------------------------------
/data/debussy_prelude_lyrics.musicxml:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2 | <!DOCTYPE score-partwise PUBLIC "-//Recordare//DTD MusicXML 3.1 Partwise//EN" "http://www.musicxml.org/dtds/partwise.dtd">
  3 | <score-partwise version="3.1">
  4 |   <movement-title>Prelude</movement-title>
  5 |   <identification>
  6 |     <creator type="composer">Debussy</creator>
  7 |     <encoding>
  8 |       <software>Finale v26 for Mac</software>
  9 |       <encoding-date>2019-10-22</encoding-date>
 10 |       <supports attribute="new-system" element="print" type="yes" value="yes"/>
 11 |       <supports attribute="new-page" element="print" type="yes" value="yes"/>
 12 |       <supports element="accidental" type="yes"/>
 13 |       <supports element="beam" type="yes"/>
 14 |       <supports element="stem" type="yes"/>
 15 |     </encoding>
 16 |   </identification>
 17 |   <defaults>
 18 |     <scaling>
 19 |       <millimeters>7.0273</millimeters>
 20 |       <tenths>40</tenths>
 21 |     </scaling>
 22 |     <page-layout>
 23 |       <page-height>1590</page-height>
 24 |       <page-width>1229</page-width>
 25 |       <page-margins type="even">
 26 |         <left-margin>57</left-margin>
 27 |         <right-margin>57</right-margin>
 28 |         <top-margin>57</top-margin>
 29 |         <bottom-margin>114</bottom-margin>
 30 |       </page-margins>
 31 |       <page-margins type="odd">
 32 |         <left-margin>57</left-margin>
 33 |         <right-margin>57</right-margin>
 34 |         <top-margin>57</top-margin>
 35 |         <bottom-margin>114</bottom-margin>
 36 |       </page-margins>
 37 |     </page-layout>
 38 |     <system-layout>
 39 |       <system-margins>
 40 |         <left-margin>0</left-margin>
 41 |         <right-margin>0</right-margin>
 42 |       </system-margins>
 43 |       <system-distance>103</system-distance>
 44 |       <top-system-distance>60</top-system-distance>
 45 |     </system-layout>
 46 |     <appearance>
 47 |       <line-width type="stem">0.918</line-width>
 48 |       <line-width type="beam">5</line-width>
 49 |       <line-width type="staff">0.918</line-width>
 50 |       <line-width type="light barline">0.918</line-width>
 51 |       <line-width type="heavy barline">5</line-width>
 52 |       <line-width type="leger">1.0807</line-width>
 53 |       <line-width type="ending">0.957</line-width>
 54 |       <line-width type="wedge">0.918</line-width>
 55 |       <line-width type="enclosure">0.918</line-width>
 56 |       <line-width type="tuplet bracket">0.918</line-width>
 57 |       <note-size type="grace">60</note-size>
 58 |       <note-size type="cue">60</note-size>
 59 |       <distance type="hyphen">120</distance>
 60 |       <distance type="beam">8</distance>
 61 |     </appearance>
 62 |     <music-font font-family="Maestro,engraved" font-size="19.9"/>
 63 |     <word-font font-family="FreeSerif" font-size="10"/>
 64 |     <lyric-font font-family="FreeSerif" font-size="10.8"/>
 65 |   </defaults>
 66 |   <credit page="1">
 67 |     <credit-words default-x="614" default-y="1534" font-size="24" justify="center" valign="top">Prelude</credit-words>
 68 |   </credit>
 69 |   <credit page="1">
 70 |     <credit-words default-x="1172" default-y="1428" font-size="12" justify="right" valign="bottom">Debussy</credit-words>
 71 |   </credit>
 72 |   <part-list>
 73 |     <score-part id="P1">
 74 |       <part-name>Flute</part-name>
 75 |       <part-abbreviation>Pno.</part-abbreviation>
 76 |       <score-instrument id="P1-I1">
 77 |         <instrument-name>Piano</instrument-name>
 78 |       </score-instrument>
 79 |       <midi-instrument id="P1-I1">
 80 |         <midi-channel>1</midi-channel>
 81 |         <midi-program>1</midi-program>
 82 |         <volume>79</volume>
 83 |         <pan>0</pan>
 84 |       </midi-instrument>
 85 |     </score-part>
 86 |   </part-list>
 87 |   <!--=========================================================-->
 88 |   <part id="P1">
 89 |     <measure number="1" width="422">
 90 |       <print>
 91 |         <system-layout>
 92 |           <top-system-distance>170</top-system-distance>
 93 |         </system-layout>
 94 |         <measure-numbering>system</measure-numbering>
 95 |       </print>
 96 |       <attributes>
 97 |         <divisions>12</divisions>
 98 |         <key>
 99 |           <fifths>4</fifths>
100 |           <mode>major</mode>
101 |         </key>
102 |         <time>
103 |           <beats>9</beats>
104 |           <beat-type>8</beat-type>
105 |         </time>
106 |         <clef>
107 |           <sign>G</sign>
108 |           <line>2</line>
109 |         </clef>
110 |       </attributes>
111 |       <direction placement="above">
112 |         <direction-type>
113 |           <metronome default-y="44" font-family="EngraverTextT" font-size="10" halign="left" relative-x="18">
114 |             <beat-unit>quarter</beat-unit>
115 |             <per-minute>60</per-minute>
116 |           </metronome>
117 |         </direction-type>
118 |         <sound tempo="60"/>
119 |       </direction>
120 |       <direction placement="above">
121 |         <direction-type>
122 |           <words default-y="40" relative-x="-30">Solo</words>
123 |         </direction-type>
124 |       </direction>
125 |       <note default-x="142">
126 |         <pitch>
127 |           <step>C</step>
128 |           <alter>1</alter>
129 |           <octave>5</octave>
130 |         </pitch>
131 |         <duration>18</duration>
132 |         <tie type="start"/>
133 |         <voice>1</voice>
134 |         <type>quarter</type>
135 |         <dot/>
136 |         <stem default-y="-50">down</stem>
137 |         <notations>
138 |           <tied type="start"/>
139 |           <slur number="1" placement="above" type="start"/>
140 |         </notations>
141 |         <lyric default-y="-80" number="1">
142 |           <syllabic>begin</syllabic>
143 |           <text>Pre</text>
144 |         </lyric>
145 |       </note>
146 |       <note default-x="181">
147 |         <pitch>
148 |           <step>C</step>
149 |           <alter>1</alter>
150 |           <octave>5</octave>
151 |         </pitch>
152 |         <duration>6</duration>
153 |         <tie type="stop"/>
154 |         <tie type="start"/>
155 |         <voice>1</voice>
156 |         <type>eighth</type>
157 |         <stem default-y="20">up</stem>
158 |         <beam number="1">begin</beam>
159 |         <notations>
160 |           <tied type="stop"/>
161 |           <tied type="start"/>
162 |         </notations>
163 |       </note>
164 |       <note default-x="213">
165 |         <pitch>
166 |           <step>C</step>
167 |           <alter>1</alter>
168 |           <octave>5</octave>
169 |         </pitch>
170 |         <duration>3</duration>
171 |         <tie type="stop"/>
172 |         <voice>1</voice>
173 |         <type>16th</type>
174 |         <stem default-y="20">up</stem>
175 |         <beam number="1">continue</beam>
176 |         <beam number="2">begin</beam>
177 |         <notations>
178 |           <tied type="stop"/>
179 |         </notations>
180 |       </note>
181 |       <note default-x="230">
182 |         <pitch>
183 |           <step>B</step>
184 |           <octave>4</octave>
185 |         </pitch>
186 |         <duration>3</duration>
187 |         <voice>1</voice>
188 |         <type>16th</type>
189 |         <stem default-y="20">up</stem>
190 |         <beam number="1">continue</beam>
191 |         <beam number="2">continue</beam>
192 |       </note>
193 |       <note default-x="259">
194 |         <pitch>
195 |           <step>A</step>
196 |           <alter>1</alter>
197 |           <octave>4</octave>
198 |         </pitch>
199 |         <duration>2</duration>
200 |         <voice>1</voice>
201 |         <type>16th</type>
202 |         <accidental>sharp</accidental>
203 |         <time-modification>
204 |           <actual-notes>3</actual-notes>
205 |           <normal-notes>2</normal-notes>
206 |         </time-modification>
207 |         <stem default-y="20">up</stem>
208 |         <beam number="1">continue</beam>
209 |         <beam number="2">continue</beam>
210 |         <notations>
211 |           <tuplet bracket="no" number="1" placement="above" type="start"/>
212 |         </notations>
213 |       </note>
214 |       <note default-x="287">
215 |         <pitch>
216 |           <step>A</step>
217 |           <octave>4</octave>
218 |         </pitch>
219 |         <duration>2</duration>
220 |         <voice>1</voice>
221 |         <type>16th</type>
222 |         <accidental>natural</accidental>
223 |         <time-modification>
224 |           <actual-notes>3</actual-notes>
225 |           <normal-notes>2</normal-notes>
226 |         </time-modification>
227 |         <stem default-y="20">up</stem>
228 |         <beam number="1">continue</beam>
229 |         <beam number="2">continue</beam>
230 |       </note>
231 |       <note default-x="304">
232 |         <pitch>
233 |           <step>G</step>
234 |           <alter>1</alter>
235 |           <octave>4</octave>
236 |         </pitch>
237 |         <duration>2</duration>
238 |         <voice>1</voice>
239 |         <type>16th</type>
240 |         <time-modification>
241 |           <actual-notes>3</actual-notes>
242 |           <normal-notes>2</normal-notes>
243 |         </time-modification>
244 |         <stem default-y="20">up</stem>
245 |         <beam number="1">end</beam>
246 |         <beam number="2">end</beam>
247 |         <notations>
248 |           <tuplet number="1" type="stop"/>
249 |         </notations>
250 |       </note>
251 |       <note default-x="332">
252 |         <pitch>
253 |           <step>G</step>
254 |           <octave>4</octave>
255 |         </pitch>
256 |         <duration>9</duration>
257 |         <voice>1</voice>
258 |         <type>eighth</type>
259 |         <dot/>
260 |         <accidental>natural</accidental>
261 |         <stem default-y="15">up</stem>
262 |         <beam number="1">begin</beam>
263 |       </note>
264 |       <note default-x="357">
265 |         <pitch>
266 |           <step>A</step>
267 |           <octave>4</octave>
268 |         </pitch>
269 |         <duration>3</duration>
270 |         <voice>1</voice>
271 |         <type>16th</type>
272 |         <stem default-y="15">up</stem>
273 |         <beam number="1">continue</beam>
274 |         <beam number="2">begin</beam>
275 |       </note>
276 |       <note default-x="374">
277 |         <pitch>
278 |           <step>B</step>
279 |           <octave>4</octave>
280 |         </pitch>
281 |         <duration>3</duration>
282 |         <voice>1</voice>
283 |         <type>16th</type>
284 |         <stem default-y="15">up</stem>
285 |         <beam number="1">continue</beam>
286 |         <beam number="2">continue</beam>
287 |       </note>
288 |       <note default-x="404">
289 |         <pitch>
290 |           <step>B</step>
291 |           <alter>1</alter>
292 |           <octave>4</octave>
293 |         </pitch>
294 |         <duration>3</duration>
295 |         <voice>1</voice>
296 |         <type>16th</type>
297 |         <accidental>sharp</accidental>
298 |         <stem default-y="15">up</stem>
299 |         <beam number="1">end</beam>
300 |         <beam number="2">end</beam>
301 |         <notations>
302 |           <slur number="1" type="stop"/>
303 |         </notations>
304 |       </note>
305 |     </measure>
306 |     <!--=======================================================-->
307 |     <measure number="2" width="293">
308 |       <note default-x="13">
309 |         <pitch>
310 |           <step>C</step>
311 |           <alter>1</alter>
312 |           <octave>5</octave>
313 |         </pitch>
314 |         <duration>18</duration>
315 |         <tie type="start"/>
316 |         <voice>1</voice>
317 |         <type>quarter</type>
318 |         <dot/>
319 |         <stem default-y="-50">down</stem>
320 |         <notations>
321 |           <tied type="start"/>
322 |           <slur number="1" placement="above" type="start"/>
323 |         </notations>
324 |         <lyric default-y="-80" number="1">
325 |           <syllabic>middle</syllabic>
326 |           <text>lude</text>
327 |         </lyric>
328 |       </note>
329 |       <note default-x="53">
330 |         <pitch>
331 |           <step>C</step>
332 |           <alter>1</alter>
333 |           <octave>5</octave>
334 |         </pitch>
335 |         <duration>6</duration>
336 |         <tie type="stop"/>
337 |         <tie type="start"/>
338 |         <voice>1</voice>
339 |         <type>eighth</type>
340 |         <stem default-y="20">up</stem>
341 |         <beam number="1">begin</beam>
342 |         <notations>
343 |           <tied type="stop"/>
344 |           <tied type="start"/>
345 |         </notations>
346 |       </note>
347 |       <note default-x="84">
348 |         <pitch>
349 |           <step>C</step>
350 |           <alter>1</alter>
351 |           <octave>5</octave>
352 |         </pitch>
353 |         <duration>3</duration>
354 |         <tie type="stop"/>
355 |         <voice>1</voice>
356 |         <type>16th</type>
357 |         <stem default-y="20">up</stem>
358 |         <beam number="1">continue</beam>
359 |         <beam number="2">begin</beam>
360 |         <notations>
361 |           <tied type="stop"/>
362 |         </notations>
363 |       </note>
364 |       <note default-x="101">
365 |         <pitch>
366 |           <step>B</step>
367 |           <octave>4</octave>
368 |         </pitch>
369 |         <duration>3</duration>
370 |         <voice>1</voice>
371 |         <type>16th</type>
372 |         <stem default-y="20">up</stem>
373 |         <beam number="1">continue</beam>
374 |         <beam number="2">continue</beam>
375 |       </note>
376 |       <note default-x="131">
377 |         <pitch>
378 |           <step>A</step>
379 |           <alter>1</alter>
380 |           <octave>4</octave>
381 |         </pitch>
382 |         <duration>2</duration>
383 |         <voice>1</voice>
384 |         <type>16th</type>
385 |         <accidental>sharp</accidental>
386 |         <time-modification>
387 |           <actual-notes>3</actual-notes>
388 |           <normal-notes>2</normal-notes>
389 |         </time-modification>
390 |         <stem default-y="20">up</stem>
391 |         <beam number="1">continue</beam>
392 |         <beam number="2">continue</beam>
393 |         <notations>
394 |           <tuplet bracket="no" number="1" placement="above" type="start"/>
395 |         </notations>
396 |       </note>
397 |       <note default-x="159">
398 |         <pitch>
399 |           <step>A</step>
400 |           <octave>4</octave>
401 |         </pitch>
402 |         <duration>2</duration>
403 |         <voice>1</voice>
404 |         <type>16th</type>
405 |         <accidental>natural</accidental>
406 |         <time-modification>
407 |           <actual-notes>3</actual-notes>
408 |           <normal-notes>2</normal-notes>
409 |         </time-modification>
410 |         <stem default-y="20">up</stem>
411 |         <beam number="1">continue</beam>
412 |         <beam number="2">continue</beam>
413 |       </note>
414 |       <note default-x="176">
415 |         <pitch>
416 |           <step>G</step>
417 |           <alter>1</alter>
418 |           <octave>4</octave>
419 |         </pitch>
420 |         <duration>2</duration>
421 |         <voice>1</voice>
422 |         <type>16th</type>
423 |         <time-modification>
424 |           <actual-notes>3</actual-notes>
425 |           <normal-notes>2</normal-notes>
426 |         </time-modification>
427 |         <stem default-y="20">up</stem>
428 |         <beam number="1">end</beam>
429 |         <beam number="2">end</beam>
430 |         <notations>
431 |           <tuplet number="1" type="stop"/>
432 |         </notations>
433 |       </note>
434 |       <note default-x="204">
435 |         <pitch>
436 |           <step>G</step>
437 |           <octave>4</octave>
438 |         </pitch>
439 |         <duration>9</duration>
440 |         <voice>1</voice>
441 |         <type>eighth</type>
442 |         <dot/>
443 |         <accidental>natural</accidental>
444 |         <stem default-y="15">up</stem>
445 |         <beam number="1">begin</beam>
446 |       </note>
447 |       <note default-x="229">
448 |         <pitch>
449 |           <step>A</step>
450 |           <octave>4</octave>
451 |         </pitch>
452 |         <duration>3</duration>
453 |         <voice>1</voice>
454 |         <type>16th</type>
455 |         <stem default-y="15">up</stem>
456 |         <beam number="1">continue</beam>
457 |         <beam number="2">begin</beam>
458 |       </note>
459 |       <note default-x="246">
460 |         <pitch>
461 |           <step>B</step>
462 |           <octave>4</octave>
463 |         </pitch>
464 |         <duration>3</duration>
465 |         <voice>1</voice>
466 |         <type>16th</type>
467 |         <stem default-y="15">up</stem>
468 |         <beam number="1">continue</beam>
469 |         <beam number="2">continue</beam>
470 |       </note>
471 |       <note default-x="275">
472 |         <pitch>
473 |           <step>B</step>
474 |           <alter>1</alter>
475 |           <octave>4</octave>
476 |         </pitch>
477 |         <duration>3</duration>
478 |         <voice>1</voice>
479 |         <type>16th</type>
480 |         <accidental>sharp</accidental>
481 |         <stem default-y="15">up</stem>
482 |         <beam number="1">end</beam>
483 |         <beam number="2">end</beam>
484 |         <notations>
485 |           <slur number="1" type="stop"/>
486 |         </notations>
487 |       </note>
488 |     </measure>
489 |     <!--=======================================================-->
490 |     <measure number="3" width="226">
491 |       <direction placement="below">
492 |         <direction-type>
493 |           <wedge default-y="-66" type="crescendo"/>
494 |         </direction-type>
495 |       </direction>
496 |       <note default-x="13">
497 |         <pitch>
498 |           <step>C</step>
499 |           <alter>1</alter>
500 |           <octave>5</octave>
501 |         </pitch>
502 |         <duration>6</duration>
503 |         <voice>1</voice>
504 |         <type>eighth</type>
505 |         <stem default-y="-50">down</stem>
506 |         <beam number="1">begin</beam>
507 |         <notations>
508 |           <slur number="1" placement="above" type="start"/>
509 |         </notations>
510 |       </note>
511 |       <note default-x="31">
512 |         <pitch>
513 |           <step>D</step>
514 |           <alter>1</alter>
515 |           <octave>5</octave>
516 |         </pitch>
517 |         <duration>6</duration>
518 |         <voice>1</voice>
519 |         <type>eighth</type>
520 |         <stem default-y="-48.5">down</stem>
521 |         <beam number="1">continue</beam>
522 |         <lyric default-y="-80" number="1">
523 |           <syllabic>end</syllabic>
524 |           <text>To</text>
525 |         </lyric>
526 |       </note>
527 |       <note default-x="70">
528 |         <pitch>
529 |           <step>G</step>
530 |           <alter>1</alter>
531 |           <octave>5</octave>
532 |         </pitch>
533 |         <duration>6</duration>
534 |         <voice>1</voice>
535 |         <type>eighth</type>
536 |         <stem default-y="-45">down</stem>
537 |         <beam number="1">end</beam>
538 |         <lyric default-y="-80" number="1">
539 |           <syllabic>single</syllabic>
540 |           <text>The</text>
541 |         </lyric>
542 |       </note>
543 |       <note default-x="89">
544 |         <pitch>
545 |           <step>E</step>
546 |           <octave>5</octave>
547 |         </pitch>
548 |         <duration>12</duration>
549 |         <voice>1</voice>
550 |         <type>quarter</type>
551 |         <stem default-y="-40">down</stem>
552 |         <lyric default-y="-80" number="1" relative-x="18">
553 |           <syllabic>begin</syllabic>
554 |           <text>Af</text>
555 |         </lyric>
556 |       </note>
557 |       <note default-x="139">
558 |         <pitch>
559 |           <step>G</step>
560 |           <alter>1</alter>
561 |           <octave>4</octave>
562 |         </pitch>
563 |         <duration>6</duration>
564 |         <voice>1</voice>
565 |         <type>eighth</type>
566 |         <stem default-y="5">up</stem>
567 |         <lyric default-y="-80" number="1">
568 |           <syllabic>middle</syllabic>
569 |           <text>ter</text>
570 |         </lyric>
571 |       </note>
572 |       <note default-x="185">
573 |         <pitch>
574 |           <step>B</step>
575 |           <octave>4</octave>
576 |         </pitch>
577 |         <duration>18</duration>
578 |         <tie type="start"/>
579 |         <voice>1</voice>
580 |         <type>quarter</type>
581 |         <dot/>
582 |         <stem default-y="-55">down</stem>
583 |         <notations>
584 |           <tied type="start"/>
585 |         </notations>
586 |         <lyric default-y="-80" number="1">
587 |           <syllabic>end</syllabic>
588 |           <text>noon</text>
589 |           <extend type="start"/>
590 |         </lyric>
591 |       </note>
592 |       <direction>
593 |         <direction-type>
594 |           <wedge spread="15" type="stop"/>
595 |         </direction-type>
596 |       </direction>
597 |     </measure>
598 |     <!--=======================================================-->
599 |     <measure number="4" width="174">
600 |       <direction placement="below">
601 |         <direction-type>
602 |           <wedge default-y="-66" spread="15" type="diminuendo"/>
603 |         </direction-type>
604 |       </direction>
605 |       <note default-x="13">
606 |         <pitch>
607 |           <step>B</step>
608 |           <octave>4</octave>
609 |         </pitch>
610 |         <duration>6</duration>
611 |         <tie type="stop"/>
612 |         <voice>1</voice>
613 |         <type>eighth</type>
614 |         <stem default-y="-55">down</stem>
615 |         <beam number="1">begin</beam>
616 |         <notations>
617 |           <tied type="stop"/>
618 |         </notations>
619 |         <lyric number="1">
620 |           <extend type="stop"/>
621 |         </lyric>
622 |       </note>
623 |       <note default-x="34">
624 |         <pitch>
625 |           <step>B</step>
626 |           <octave>4</octave>
627 |         </pitch>
628 |         <duration>6</duration>
629 |         <voice>1</voice>
630 |         <type>eighth</type>
631 |         <stem default-y="-55">down</stem>
632 |         <beam number="1">continue</beam>
633 |         <lyric default-y="-80" number="1">
634 |           <syllabic>single</syllabic>
635 |           <text>Of</text>
636 |         </lyric>
637 |       </note>
638 |       <note default-x="62">
639 |         <pitch>
640 |           <step>C</step>
641 |           <alter>1</alter>
642 |           <octave>5</octave>
643 |         </pitch>
644 |         <duration>6</duration>
645 |         <voice>1</voice>
646 |         <type>eighth</type>
647 |         <stem default-y="-55">down</stem>
648 |         <beam number="1">end</beam>
649 |         <lyric default-y="-80" number="1">
650 |           <syllabic>single</syllabic>
651 |           <text>A</text>
652 |         </lyric>
653 |       </note>
654 |       <direction>
655 |         <direction-type>
656 |           <wedge type="stop"/>
657 |         </direction-type>
658 |       </direction>
659 |       <note default-x="99">
660 |         <pitch>
661 |           <step>A</step>
662 |           <alter>1</alter>
663 |           <octave>4</octave>
664 |         </pitch>
665 |         <duration>36</duration>
666 |         <voice>1</voice>
667 |         <type>half</type>
668 |         <dot/>
669 |         <accidental>sharp</accidental>
670 |         <stem default-y="10">up</stem>
671 |         <notations>
672 |           <slur number="1" type="stop"/>
673 |         </notations>
674 |         <lyric default-y="-80" number="1" relative-x="5">
675 |           <syllabic>single</syllabic>
676 |           <text>Faun</text>
677 |         </lyric>
678 |       </note>
679 |       <barline location="right">
680 |         <bar-style>light-heavy</bar-style>
681 |       </barline>
682 |     </measure>
683 |   </part>
684 |   <!--=========================================================-->
685 | </score-partwise>
686 | 


--------------------------------------------------------------------------------
/data/example1.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/data/example1.wav


--------------------------------------------------------------------------------
/data/example2.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/data/example2.wav


--------------------------------------------------------------------------------
/data/examples_filelist.txt:
--------------------------------------------------------------------------------
1 | data/example1.wav|exploring the expanses of space to keep our planet safe|1
2 | data/example2.wav|and all the species that call it home|1
3 | 


--------------------------------------------------------------------------------
/data_utils.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | import os
  3 | import re
  4 | import numpy as np
  5 | import torch
  6 | import torch.utils.data
  7 | import librosa
  8 | 
  9 | import layers
 10 | from utils import load_wav_to_torch, load_filepaths_and_text
 11 | from text import text_to_sequence, cmudict
 12 | from yin import compute_yin
 13 | 
 14 | 
 15 | class TextMelLoader(torch.utils.data.Dataset):
 16 |     """
 17 |         1) loads audio, text and speaker ids
 18 |         2) normalizes text and converts them to sequences of one-hot vectors
 19 |         3) computes mel-spectrograms and f0s from audio files.
 20 |     """
 21 |     def __init__(self, audiopaths_and_text, hparams, speaker_ids=None):
 22 |         self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text)
 23 |         self.text_cleaners = hparams.text_cleaners
 24 |         self.max_wav_value = hparams.max_wav_value
 25 |         self.sampling_rate = hparams.sampling_rate
 26 |         self.stft = layers.TacotronSTFT(
 27 |             hparams.filter_length, hparams.hop_length, hparams.win_length,
 28 |             hparams.n_mel_channels, hparams.sampling_rate, hparams.mel_fmin,
 29 |             hparams.mel_fmax)
 30 |         self.sampling_rate = hparams.sampling_rate
 31 |         self.filter_length = hparams.filter_length
 32 |         self.hop_length = hparams.hop_length
 33 |         self.f0_min = hparams.f0_min
 34 |         self.f0_max = hparams.f0_max
 35 |         self.harm_thresh = hparams.harm_thresh
 36 |         self.p_arpabet = hparams.p_arpabet
 37 | 
 38 |         self.cmudict = None
 39 |         if hparams.cmudict_path is not None:
 40 |             self.cmudict = cmudict.CMUDict(hparams.cmudict_path)
 41 | 
 42 |         self.speaker_ids = speaker_ids
 43 |         if speaker_ids is None:
 44 |             self.speaker_ids = self.create_speaker_lookup_table(
 45 |                 self.audiopaths_and_text)
 46 | 
 47 |         random.seed(1234)
 48 |         random.shuffle(self.audiopaths_and_text)
 49 | 
 50 |     def create_speaker_lookup_table(self, audiopaths_and_text):
 51 |         speaker_ids = np.sort(np.unique([x[2] for x in audiopaths_and_text]))
 52 |         d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
 53 |         return d
 54 | 
 55 |     def get_f0(self, audio, sampling_rate=22050, frame_length=1024,
 56 |                hop_length=256, f0_min=100, f0_max=300, harm_thresh=0.1):
 57 |         f0, harmonic_rates, argmins, times = compute_yin(
 58 |             audio, sampling_rate, frame_length, hop_length, f0_min, f0_max,
 59 |             harm_thresh)
 60 |         pad = int((frame_length / hop_length) / 2)
 61 |         f0 = [0.0] * pad + f0 + [0.0] * pad
 62 | 
 63 |         f0 = np.array(f0, dtype=np.float32)
 64 |         return f0
 65 | 
 66 |     def get_data(self, audiopath_and_text):
 67 |         audiopath, text, speaker = audiopath_and_text
 68 |         text = self.get_text(text)
 69 |         mel, f0 = self.get_mel_and_f0(audiopath)
 70 |         speaker_id = self.get_speaker_id(speaker)
 71 |         return (text, mel, speaker_id, f0)
 72 | 
 73 |     def get_speaker_id(self, speaker_id):
 74 |         return torch.IntTensor([self.speaker_ids[int(speaker_id)]])
 75 | 
 76 |     def get_mel_and_f0(self, filepath):
 77 |         audio, sampling_rate = load_wav_to_torch(filepath)
 78 |         if sampling_rate != self.stft.sampling_rate:
 79 |             raise ValueError("{} SR doesn't match target {} SR".format(
 80 |                 sampling_rate, self.stft.sampling_rate))
 81 |         audio_norm = audio / self.max_wav_value
 82 |         audio_norm = audio_norm.unsqueeze(0)
 83 |         melspec = self.stft.mel_spectrogram(audio_norm)
 84 |         melspec = torch.squeeze(melspec, 0)
 85 | 
 86 |         f0 = self.get_f0(audio.cpu().numpy(), self.sampling_rate,
 87 |                          self.filter_length, self.hop_length, self.f0_min,
 88 |                          self.f0_max, self.harm_thresh)
 89 |         f0 = torch.from_numpy(f0)[None]
 90 |         f0 = f0[:, :melspec.size(1)]
 91 | 
 92 |         return melspec, f0
 93 | 
 94 |     def get_text(self, text):
 95 |         text_norm = torch.IntTensor(
 96 |             text_to_sequence(text, self.text_cleaners, self.cmudict, self.p_arpabet))
 97 | 
 98 |         return text_norm
 99 | 
100 |     def __getitem__(self, index):
101 |         return self.get_data(self.audiopaths_and_text[index])
102 | 
103 |     def __len__(self):
104 |         return len(self.audiopaths_and_text)
105 | 
106 | 
107 | class TextMelCollate():
108 |     """ Zero-pads model inputs and targets based on number of frames per setep
109 |     """
110 |     def __init__(self, n_frames_per_step):
111 |         self.n_frames_per_step = n_frames_per_step
112 | 
113 |     def __call__(self, batch):
114 |         """Collate's training batch from normalized text and mel-spectrogram
115 |         PARAMS
116 |         ------
117 |         batch: [text_normalized, mel_normalized]
118 |         """
119 |         # Right zero-pad all one-hot text sequences to max input length
120 |         input_lengths, ids_sorted_decreasing = torch.sort(
121 |             torch.LongTensor([len(x[0]) for x in batch]),
122 |             dim=0, descending=True)
123 |         max_input_len = input_lengths[0]
124 | 
125 |         text_padded = torch.LongTensor(len(batch), max_input_len)
126 |         text_padded.zero_()
127 |         for i in range(len(ids_sorted_decreasing)):
128 |             text = batch[ids_sorted_decreasing[i]][0]
129 |             text_padded[i, :text.size(0)] = text
130 | 
131 |         # Right zero-pad mel-spec
132 |         num_mels = batch[0][1].size(0)
133 |         max_target_len = max([x[1].size(1) for x in batch])
134 |         if max_target_len % self.n_frames_per_step != 0:
135 |             max_target_len += self.n_frames_per_step - max_target_len % self.n_frames_per_step
136 |             assert max_target_len % self.n_frames_per_step == 0
137 | 
138 |         # include mel padded, gate padded and speaker ids
139 |         mel_padded = torch.FloatTensor(len(batch), num_mels, max_target_len)
140 |         mel_padded.zero_()
141 |         gate_padded = torch.FloatTensor(len(batch), max_target_len)
142 |         gate_padded.zero_()
143 |         output_lengths = torch.LongTensor(len(batch))
144 |         speaker_ids = torch.LongTensor(len(batch))
145 |         f0_padded = torch.FloatTensor(len(batch), 1, max_target_len)
146 |         f0_padded.zero_()
147 | 
148 |         for i in range(len(ids_sorted_decreasing)):
149 |             mel = batch[ids_sorted_decreasing[i]][1]
150 |             mel_padded[i, :, :mel.size(1)] = mel
151 |             gate_padded[i, mel.size(1)-1:] = 1
152 |             output_lengths[i] = mel.size(1)
153 |             speaker_ids[i] = batch[ids_sorted_decreasing[i]][2]
154 |             f0 = batch[ids_sorted_decreasing[i]][3]
155 |             f0_padded[i, :, :f0.size(1)] = f0
156 | 
157 |         model_inputs = (text_padded, input_lengths, mel_padded, gate_padded,
158 |                         output_lengths, speaker_ids, f0_padded)
159 | 
160 |         return model_inputs
161 | 


--------------------------------------------------------------------------------
/distributed.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.distributed as dist
  3 | from torch.nn.modules import Module
  4 | from torch.autograd import Variable
  5 | 
  6 | def _flatten_dense_tensors(tensors):
  7 |     """Flatten dense tensors into a contiguous 1D buffer. Assume tensors are of
  8 |     same dense type.
  9 |     Since inputs are dense, the resulting tensor will be a concatenated 1D
 10 |     buffer. Element-wise operation on this buffer will be equivalent to
 11 |     operating individually.
 12 |     Arguments:
 13 |         tensors (Iterable[Tensor]): dense tensors to flatten.
 14 |     Returns:
 15 |         A contiguous 1D buffer containing input tensors.
 16 |     """
 17 |     if len(tensors) == 1:
 18 |         return tensors[0].contiguous().view(-1)
 19 |     flat = torch.cat([t.contiguous().view(-1).float() for t in tensors], dim=0)
 20 |     return flat
 21 | 
 22 | def _unflatten_dense_tensors(flat, tensors):
 23 |     """View a flat buffer using the sizes of tensors. Assume that tensors are of
 24 |     same dense type, and that flat is given by _flatten_dense_tensors.
 25 |     Arguments:
 26 |         flat (Tensor): flattened dense tensors to unflatten.
 27 |         tensors (Iterable[Tensor]): dense tensors whose sizes will be used to
 28 |           unflatten flat.
 29 |     Returns:
 30 |         Unflattened dense tensors with sizes same as tensors and values from
 31 |         flat.
 32 |     """
 33 |     outputs = []
 34 |     offset = 0
 35 |     for tensor in tensors:
 36 |         numel = tensor.numel()
 37 |         outputs.append(flat.narrow(0, offset, numel).view_as(tensor))
 38 |         offset += numel
 39 |     return tuple(outputs)
 40 | 
 41 | 
 42 | '''
 43 | This version of DistributedDataParallel is designed to be used in conjunction with the multiproc.py
 44 | launcher included with this example. It assumes that your run is using multiprocess with 1
 45 | GPU/process, that the model is on the correct device, and that torch.set_device has been
 46 | used to set the device.
 47 | 
 48 | Parameters are broadcasted to the other processes on initialization of DistributedDataParallel,
 49 | and will be allreduced at the finish of the backward pass.
 50 | '''
 51 | class DistributedDataParallel(Module):
 52 | 
 53 |     def __init__(self, module):
 54 |         super(DistributedDataParallel, self).__init__()
 55 |         #fallback for PyTorch 0.3
 56 |         if not hasattr(dist, '_backend'):
 57 |             self.warn_on_half = True
 58 |         else:
 59 |             self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
 60 | 
 61 |         self.module = module
 62 | 
 63 |         for p in self.module.state_dict().values():
 64 |             if not torch.is_tensor(p):
 65 |                 continue
 66 |             dist.broadcast(p, 0)
 67 | 
 68 |         def allreduce_params():
 69 |             if(self.needs_reduction):
 70 |                 self.needs_reduction = False
 71 |                 buckets = {}
 72 |                 for param in self.module.parameters():
 73 |                     if param.requires_grad and param.grad is not None:
 74 |                         tp = type(param.data)
 75 |                         if tp not in buckets:
 76 |                             buckets[tp] = []
 77 |                         buckets[tp].append(param)
 78 |                 if self.warn_on_half:
 79 |                     if torch.cuda.HalfTensor in buckets:
 80 |                         print("WARNING: gloo dist backend for half parameters may be extremely slow." +
 81 |                               " It is recommended to use the NCCL backend in this case. This currently requires" +
 82 |                               "PyTorch built from top of tree master.")
 83 |                         self.warn_on_half = False
 84 | 
 85 |                 for tp in buckets:
 86 |                     bucket = buckets[tp]
 87 |                     grads = [param.grad.data for param in bucket]
 88 |                     coalesced = _flatten_dense_tensors(grads)
 89 |                     dist.all_reduce(coalesced)
 90 |                     coalesced /= dist.get_world_size()
 91 |                     for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
 92 |                         buf.copy_(synced)
 93 | 
 94 |         for param in list(self.module.parameters()):
 95 |             def allreduce_hook(*unused):
 96 |                 param._execution_engine.queue_callback(allreduce_params)
 97 |             if param.requires_grad:
 98 |                 param.register_hook(allreduce_hook)
 99 | 
100 |     def forward(self, *inputs, **kwargs):
101 |         self.needs_reduction = True
102 |         return self.module(*inputs, **kwargs)
103 | 
104 |     '''
105 |     def _sync_buffers(self):
106 |         buffers = list(self.module._all_buffers())
107 |         if len(buffers) > 0:
108 |             # cross-node buffer sync
109 |             flat_buffers = _flatten_dense_tensors(buffers)
110 |             dist.broadcast(flat_buffers, 0)
111 |             for buf, synced in zip(buffers, _unflatten_dense_tensors(flat_buffers, buffers)):
112 |                 buf.copy_(synced)
113 |      def train(self, mode=True):
114 |         # Clear NCCL communicator and CUDA event cache of the default group ID,
115 |         # These cache will be recreated at the later call. This is currently a
116 |         # work-around for a potential NCCL deadlock.
117 |         if dist._backend == dist.dist_backend.NCCL:
118 |             dist._clear_group_cache()
119 |         super(DistributedDataParallel, self).train(mode)
120 |         self.module.train(mode)
121 |     '''
122 | '''
123 | Modifies existing model to do gradient allreduce, but doesn't change class
124 | so you don't need "module"
125 | '''
126 | def apply_gradient_allreduce(module):
127 |         if not hasattr(dist, '_backend'):
128 |             module.warn_on_half = True
129 |         else:
130 |             module.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
131 | 
132 |         for p in module.state_dict().values():
133 |             if not torch.is_tensor(p):
134 |                 continue
135 |             dist.broadcast(p, 0)
136 | 
137 |         def allreduce_params():
138 |             if(module.needs_reduction):
139 |                 module.needs_reduction = False
140 |                 buckets = {}
141 |                 for param in module.parameters():
142 |                     if param.requires_grad and param.grad is not None:
143 |                         tp = type(param.data)
144 |                         if tp not in buckets:
145 |                             buckets[tp] = []
146 |                         buckets[tp].append(param)
147 |                 if module.warn_on_half:
148 |                     if torch.cuda.HalfTensor in buckets:
149 |                         print("WARNING: gloo dist backend for half parameters may be extremely slow." +
150 |                               " It is recommended to use the NCCL backend in this case. This currently requires" +
151 |                               "PyTorch built from top of tree master.")
152 |                         module.warn_on_half = False
153 | 
154 |                 for tp in buckets:
155 |                     bucket = buckets[tp]
156 |                     grads = [param.grad.data for param in bucket]
157 |                     coalesced = _flatten_dense_tensors(grads)
158 |                     dist.all_reduce(coalesced)
159 |                     coalesced /= dist.get_world_size()
160 |                     for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
161 |                         buf.copy_(synced)
162 | 
163 |         for param in list(module.parameters()):
164 |             def allreduce_hook(*unused):
165 |                 Variable._execution_engine.queue_callback(allreduce_params)
166 |             if param.requires_grad:
167 |                 param.register_hook(allreduce_hook)
168 | 
169 |         def set_needs_reduction(self, input, output):
170 |             self.needs_reduction = True
171 | 
172 |         module.register_forward_hook(set_needs_reduction)
173 |         return module
174 | 


--------------------------------------------------------------------------------
/filelists/libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt:
--------------------------------------------------------------------------------
  1 | /path_to_libritts/7367/86737/7367_86737_000132_000020.wav|There was some surprise, however, that, as he was with his face to the enemy, he should have received a ball between his shoulders. That astonishment ceased when one of the brigands remarked to his comrades that Cucumetto was stationed ten paces in Carlini's rear when he fell.|7367
  2 | /path_to_libritts/7511/102419/7511_102419_000020_000000.wav|Mr. Stewart is the queerest man: instead of letting me enjoy the tableau, he solemnly drove on, saying he would not want any one gawking at him if he were the happy man.|7511
  3 | /path_to_libritts/7078/271888/7078_271888_000062_000000.wav|"It is a strange fancy of mine," she explained, when I had greeted her. "I'm sure the dress is very becoming--isn't it?" And she waved the goblet she was holding above her head.|7078
  4 | /path_to_libritts/2289/152257/2289_152257_000026_000000.wav|To the last year of his life Justinian was strong and active and a hard worker.|2289
  5 | /path_to_libritts/3879/173592/3879_173592_000033_000002.wav|Friends or foes, French or Spaniards, succor or death,--betwixt these were their hopes and fears divided.|3879
  6 | /path_to_libritts/6529/62554/6529_62554_000068_000000.wav|"Isn't what they have done already enough?" asked Pencroft, who did not understand these scruples.|6529
  7 | /path_to_libritts/1263/141777/1263_141777_000033_000002.wav|Clumps of small trees and high growing bushes dotted that expanse, an ideal cover.|1263
  8 | /path_to_libritts/200/126784/200_126784_000053_000000.wav|"I did not know it until I reached them, and it was then too late to retreat," said Wharton sullenly.|200
  9 | /path_to_libritts/6529/62554/6529_62554_000062_000001.wav|"Poor Ayrton!|6529
 10 | /path_to_libritts/7190/90543/7190_90543_000039_000000.wav|I had become so accustomed to Quarles jumping to some sudden conclusion that I was disappointed.|7190
 11 | /path_to_libritts/7367/86737/7367_86737_000109_000001.wav|he is only two and twenty;--he will gain himself a reputation."|7367
 12 | /path_to_libritts/4267/72637/4267_72637_000035_000000.wav|She withdrew entirely now, all but her hand, and her eyes sought the ground.|4267
 13 | /path_to_libritts/2436/2476/2436_2476_000004_000003.wav|Then I sought out Carter.|2436
 14 | /path_to_libritts/6272/70191/6272_70191_000027_000000.wav|The graveyard was a mile outside the village--a sandy plain where a few stunted pines transplanted from the woods near it struggled to keep alive.|6272
 15 | /path_to_libritts/2416/152139/2416_152139_000001_000001.wav|The rest of the house was in darkness.|2416
 16 | /path_to_libritts/8419/293469/8419_293469_000003_000002.wav|All the wonderful things of which he told had happened before he came to the meadow, and while he was still a young Frog.|8419
 17 | /path_to_libritts/1088/134315/1088_134315_000020_000000.wav|"Is that your creed?" she asked quietly.|1088
 18 | /path_to_libritts/7367/86737/7367_86737_000132_000001.wav|The old man remained motionless; he felt that some great and unforeseen misfortune hung over his head.|7367
 19 | /path_to_libritts/8108/274318/8108_274318_000018_000002.wav|He seemed very pleased with himself, and smiled with an expression of supreme innocence.|8108
 20 | /path_to_libritts/3983/5371/3983_5371_000019_000000.wav|Miss Corny paused.|3983
 21 | /path_to_libritts/4406/16883/4406_16883_000024_000000.wav|THE FIFTEENTH REMOVE|4406
 22 | /path_to_libritts/4640/19187/4640_19187_000038_000001.wav|He returned from his sombre eagle flight into outer darkness.|4640
 23 | /path_to_libritts/1502/122615/1502_122615_000004_000001.wav|Slowly and reluctantly yielding to the necessity, he quitted the place, and mingled with the throng that hovered nigh.|1502
 24 | /path_to_libritts/6209/34601/6209_34601_000068_000030.wav|I was alone.|6209
 25 | /path_to_libritts/7190/90542/7190_90542_000054_000000.wav|We took rooms at a hotel in Medworth, Quarles explaining that our investigations might take some days.|7190
 26 | /path_to_libritts/7226/86964/7226_86964_000007_000002.wav|The latter made a pretty picture standing out on the bold bank, backed by a number of huge stacks of golden straw.|7226
 27 | /path_to_libritts/6209/34601/6209_34601_000096_000023.wav|What a hurricane!|6209
 28 | /path_to_libritts/4406/16882/4406_16882_000018_000002.wav|About two hours in the night, my sweet babe like a lamb departed this life on Feb. 18, 1675.|4406
 29 | /path_to_libritts/6818/68772/6818_68772_000008_000000.wav|After luncheon began the speech-making, interspersed with music by the band.|6818
 30 | /path_to_libritts/5022/29411/5022_29411_000058_000009.wav|My mother whispered to me--I thanked the little mill-girl, and gave her a kiss.|5022
 31 | /path_to_libritts/8088/284756/8088_284756_000185_000002.wav|Her eyes met mine and I knew that I had not misunderstood.|8088
 32 | /path_to_libritts/1841/150351/1841_150351_000025_000003.wav|The figures of animals and birds carved upon it represent the mythological ancestors of the family or clan in front of whose abode the pole stands.|1841
 33 | /path_to_libritts/2836/5355/2836_5355_000063_000000.wav|She made no reply.|2836
 34 | /path_to_libritts/2836/5355/2836_5355_000008_000000.wav|"What do you mean by 'deal?'" he asked, settling the logs to his apparent satisfaction.|2836
 35 | /path_to_libritts/8609/262281/8609_262281_000047_000002.wav|One of the girls shall go with us to-day; whichever deserves it best."|8609
 36 | /path_to_libritts/7067/76048/7067_76048_000044_000006.wav|The big thing confronts us still.|7067
 37 | /path_to_libritts/8324/286681/8324_286681_000008_000003.wav|When he had thought for a while, he waved his big pinching-claws and said, "It would be better for me not to tell what I think.|8324
 38 | /path_to_libritts/6385/34669/6385_34669_000022_000001.wav|It was a long platform, floored and tarred, supported by a network of joists, and under which flowed the river.|6385
 39 | /path_to_libritts/8238/283452/8238_283452_000022_000008.wav|And there are very few funny child-ghosts--you might almost say none, in comparison with the number of grown-ups.|8238
 40 | /path_to_libritts/4018/103416/4018_103416_000018_000001.wav|But I meant them."|4018
 41 | /path_to_libritts/7800/283478/7800_283478_000008_000000.wav|"Now you've gone and done it, you young rapscallions!" cried Isaac Chase, so excited that he could hardly control his trembling voice.|7800
 42 | /path_to_libritts/1246/135815/1246_135815_000015_000000.wav|"He certainly is all right as a digger," exclaimed Peter Rabbit. "My, how he can make the sand fly!|1246
 43 | /path_to_libritts/374/180299/374_180299_000013_000000.wav|"We went all over the house, and we shall have everything perfect.|374
 44 | /path_to_libritts/2436/2481/2436_2481_000024_000000.wav|I stammered, "If ... if she dies ... will you flash us word?"|2436
 45 | /path_to_libritts/7067/76047/7067_76047_000048_000007.wav|And there is something more to be said.|7067
 46 | /path_to_libritts/1970/28415/1970_28415_000016_000001.wav|This was his Father's house and his house.|1970
 47 | /path_to_libritts/4680/16026/4680_16026_000045_000003.wav|And my mother?|4680
 48 | /path_to_libritts/5104/33406/5104_33406_000113_000003.wav|So of those thirty-five ships only fifteen got to Greenland.|5104
 49 | /path_to_libritts/7367/86737/7367_86737_000099_000000.wav|"Well, Signor Pastrini," said Franz, "now that my companion is quieted, and you have seen how peaceful my intentions are, tell me who is this Luigi Vampa.|7367
 50 | /path_to_libritts/2952/410/2952_410_000034_000001.wav|They judged him to be a hardened criminal, and his story an insult to their intelligence.|2952
 51 | /path_to_libritts/1246/124550/1246_124550_000012_000000.wav|Her first acquaintances were the members of the Tincomb Methodist Church, a vast red-brick tabernacle.|1246
 52 | /path_to_libritts/5393/19219/5393_19219_000019_000004.wav|The house was no less fragrant than the church; after the incense, roses.|5393
 53 | /path_to_libritts/1263/138246/1263_138246_000005_000001.wav|We are bound to make the best of our new lodgings, and make ourselves comfortable.|1263
 54 | /path_to_libritts/7059/88364/7059_88364_000012_000007.wav|The California photo-playwright can base his Crowd Picture upon the city-worshipping mobs of San Francisco.|7059
 55 | /path_to_libritts/587/54108/587_54108_000030_000001.wav|"Then will you wrap something about you and come down to the river?"|587
 56 | /path_to_libritts/7278/246956/7278_246956_000013_000000.wav|The reason of George Bascombe's absence from church that morning was, that, after an early breakfast, he had mounted Helen's mare, and set out to call on Mr. Hooker before he should have gone to church.|7278
 57 | /path_to_libritts/4406/16883/4406_16883_000001_000001.wav|Connecticut, to meet with King Philip.|4406
 58 | /path_to_libritts/831/130746/831_130746_000052_000001.wav|I was unhappy at home--never mind why.|831
 59 | /path_to_libritts/2196/174172/2196_174172_000002_000015.wav|Then begin to talk about it.|2196
 60 | /path_to_libritts/7302/86814/7302_86814_000011_000004.wav|"Let us horsewhip the fine gentleman!" said others.|7302
 61 | /path_to_libritts/4195/186238/4195_186238_000039_000000.wav|"Why, you foolish old Uncle!|4195
 62 | /path_to_libritts/1963/142393/1963_142393_000005_000010.wav|As Adam's confidence waned, his patience waned with it, and he thought he must write himself.|1963
 63 | /path_to_libritts/2836/5355/2836_5355_000038_000001.wav|"How on the wrong scent?"|2836
 64 | /path_to_libritts/7178/34644/7178_34644_000049_000001.wav|This message addressed to justice has been faithfully delivered by the sea."|7178
 65 | /path_to_libritts/1263/138246/1263_138246_000056_000001.wav|A few instants alone separate us from an eventful moment.|1263
 66 | /path_to_libritts/5678/43302/5678_43302_000015_000001.wav|Then she broke off and sat back.|5678
 67 | /path_to_libritts/2836/5355/2836_5355_000065_000001.wav|"These apartments are mine now; they have been transferred into my name, and they can never again afford you accommodation.|2836
 68 | /path_to_libritts/8098/278278/8098_278278_000010_000000.wav|Coonskin's opinion didn't benefit Pod much.|8098
 69 | /path_to_libritts/8324/286683/8324_286683_000018_000001.wav|You're big enough, but you're just as homely as you can be.|8324
 70 | /path_to_libritts/3664/178366/3664_178366_000019_000011.wav|The play of "Buffalo Bill" had a very successful run of six or eight weeks, and was afterwards produced in all the principal cities of the country, everywhere being received with genuine enthusiasm.|3664
 71 | /path_to_libritts/5339/14133/5339_14133_000040_000001.wav|I'm noan comin' down again to-night.'|5339
 72 | /path_to_libritts/5703/47198/5703_47198_000003_000000.wav|One hot summer day, a few months after the marriage, Juliet, returning to the consulate after a morning spent in very active exercise upon a tennis court, was met on the doorstep by Dora, the youngest of the Clarency Butchers, who was awaiting her approach in a high state of excitement.|5703
 73 | /path_to_libritts/78/369/78_369_000066_000009.wav|Urged thus far, I had no choice but to adapt my nature to an element which I had willingly chosen.|78
 74 | /path_to_libritts/4051/11218/4051_11218_000036_000000.wav|"It is only a sleeping potion," said the enchantress to Prince Jason. "One always finds a use for these mischievous creatures, sooner or later; so I did not wish to kill him outright.|4051
 75 | /path_to_libritts/8838/298545/8838_298545_000055_000000.wav|"I am always in training."|8838
 76 | /path_to_libritts/4680/16041/4680_16041_000014_000002.wav|Entering a street was like entering a cellar.|4680
 77 | /path_to_libritts/3526/176653/3526_176653_000034_000000.wav|"Good Lord!" said the fisherman, startled, and then he stopped--the words were as innocent on her lips as a benediction.|3526
 78 | /path_to_libritts/87/121553/87_121553_000058_000000.wav|Let him the mouth imagine of the horn That in the point beginneth of the axis Round about which the primal wheel revolves,--|87
 79 | /path_to_libritts/3526/176653/3526_176653_000006_000000.wav|Her eyes fell at the ancient banter, but she lifted them straightway and stared again.|3526
 80 | /path_to_libritts/8629/261139/8629_261139_000031_000000.wav|"I have made an assertion," said he, "before God and before this jury. To make it seem a credible one I shall have to tell my own story from the beginning.|8629
 81 | /path_to_libritts/6209/34601/6209_34601_000096_000042.wav|Besides, the laws are violated.|6209
 82 | /path_to_libritts/3526/176651/3526_176651_000012_000000.wav|She sat at the base of the big tree--her little sunbonnet pushed back, her arms locked about her knees, her bare feet gathered under her crimson gown and her deep eyes fixed on the smoke in the valley below. Her breath was still coming fast between her parted lips.|3526
 83 | /path_to_libritts/7402/90848/7402_90848_000024_000003.wav|It was very dignified and wore tortoise-shell glasses."|7402
 84 | /path_to_libritts/118/47824/118_47824_000036_000000.wav|"Good morning.|118
 85 | /path_to_libritts/405/130895/405_130895_000051_000001.wav|It was white underneath and reddish on top, with big round spots of deep blue encircled in black, its hide quite smooth and ending in a double-lobed fin.|405
 86 | /path_to_libritts/8088/284756/8088_284756_000001_000002.wav|Rum-runners, seeking out their hidden port with their cargo of contraband from Cuba.|8088
 87 | /path_to_libritts/7190/90542/7190_90542_000002_000000.wav|My association with Christopher Quarles has, however, led to the solution of some strange mysteries, and, since my own achievements are sufficiently well known, I may confine myself to those cases which, single-handed, I should have failed to solve.|7190
 88 | /path_to_libritts/5339/14134/5339_14134_000047_000001.wav|'Liking has nought to do with it.'|5339
 89 | /path_to_libritts/2289/152258/2289_152258_000017_000000.wav|Mohammed was very earnest and serious.|2289
 90 | /path_to_libritts/8098/275181/8098_275181_000020_000006.wav|But Tom overcame me forthwith, choked me nearly black in the face, then, in dumb show, knocked my head with a stone.|8098
 91 | /path_to_libritts/6209/34599/6209_34599_000025_000003.wav|An iron knocker was attached to it.|6209
 92 | /path_to_libritts/2836/5355/2836_5355_000072_000001.wav|Count them."|2836
 93 | /path_to_libritts/5750/100289/5750_100289_000023_000000.wav|Bishop Whipple developed many able preachers, of whom perhaps the most accomplished was the Rev. Charles Smith Cook, of the Yankton Sioux.|5750
 94 | /path_to_libritts/1970/26100/1970_26100_000068_000000.wav|"Then Mr. Woods wasn't here all through dinner, Jackson?"|1970
 95 | /path_to_libritts/7278/91083/7278_91083_000012_000002.wav|Besides, he was never idle, he was economical, his habits were the best, and why should not such a boy succeed?|7278
 96 | /path_to_libritts/4297/13006/4297_13006_000058_000000.wav|"That is nonsense, Emily.|4297
 97 | /path_to_libritts/7302/86814/7302_86814_000007_000001.wav|The prisoners then approached and formed a circle.|7302
 98 | /path_to_libritts/831/130746/831_130746_000022_000001.wav|"And fairish?"|831
 99 | /path_to_libritts/2289/152258/2289_152258_000004_000001.wav|He always spoke the truth and never broke a promise.|2289
100 | /path_to_libritts/8419/293469/8419_293469_000010_000005.wav|Oh, how frightened I was!|8419
101 | /path_to_libritts/7178/34645/7178_34645_000042_000002.wav|The work on which he was engaged could only be expressed in these strange words--the construction of a thunderbolt.|7178
102 | /path_to_libritts/6385/34669/6385_34669_000011_000006.wav|The wolf appeared to him in a halo of light.|6385
103 | /path_to_libritts/669/129061/669_129061_000004_000002.wav|His gentleman alone took the opportunity of perusing the newspaper before he laid it by his master's desk.|669
104 | /path_to_libritts/4680/16026/4680_16026_000115_000000.wav|In the meantime she stared at them with a stern but peaceful air.|4680
105 | /path_to_libritts/6415/116629/6415_116629_000012_000001.wav|I said you'd better take your hands out of your pockets, and then your earnings would run in.|6415
106 | /path_to_libritts/8123/275209/8123_275209_000008_000002.wav|How should a poor crawling creature like me know what to do without asking my betters?"|8123
107 | /path_to_libritts/1088/134315/1088_134315_000111_000001.wav|It was a large safe of the usual type.|1088
108 | /path_to_libritts/696/93314/696_93314_000049_000000.wav|"I'm not likely to have forgotten you," said the Lover.|696
109 | /path_to_libritts/3983/5371/3983_5371_000047_000000.wav|"What did she go into hysterics for?" again snapped Miss Carlyle.|3983
110 | /path_to_libritts/3486/166424/3486_166424_000067_000000.wav|Within the black background of the fissure stood a shape, an apparition, a woman--beautiful, awesome, incredible!|3486
111 | /path_to_libritts/730/358/730_358_000004_000004.wav|Sometimes I tried to imitate the pleasant songs of the birds but was unable. Sometimes I wished to express my sensations in my own mode, but the uncouth and inarticulate sounds which broke from me frightened me into silence again.|730
112 | /path_to_libritts/6415/111615/6415_111615_000022_000003.wav|Poor soul, he is like one condemned to harangue the vast, idiotic world through a keyhole, whence his anguish issues thin and faint.|6415
113 | /path_to_libritts/8088/284756/8088_284756_000169_000000.wav|They searched ceaselessly for something, and I guessed that something was food.|8088
114 | /path_to_libritts/8088/284756/8088_284756_000181_000002.wav|The man and the woman came up, and I looked closely into their faces.|8088
115 | /path_to_libritts/7505/83618/7505_83618_000009_000010.wav|King Manco, however, was a real character, the Rudolph of Hapsburg of their reigning family, and flourished about the eleventh century.|7505
116 | /path_to_libritts/4406/16883/4406_16883_000001_000013.wav|When I came ashore, they gathered all about me, I sitting alone in the midst.|4406
117 | /path_to_libritts/5022/29411/5022_29411_000020_000000.wav|"But, Mr. Toller," I objected, "something must have happened to distress her.|5022
118 | /path_to_libritts/7190/90542/7190_90542_000113_000000.wav|His calmness almost exasperated me, but he would answer no questions until we had returned to our hotel and had breakfast.|7190
119 | /path_to_libritts/405/130894/405_130894_000046_000000.wav|"Professor Aronnax," he told me, "this calls for heroic measures, or we'll be sealed up in this solidified water as if it were cement."|405
120 | /path_to_libritts/7302/86815/7302_86815_000008_000001.wav|The Judge.|7302
121 | /path_to_libritts/7367/86737/7367_86737_000130_000002.wav|Cucumetto fancied for a moment the young man was about to take her in his arms and fly; but this mattered little to him now Rita had been his; and as for the money, three hundred piastres distributed among the band was so small a sum that he cared little about it.|7367
122 | /path_to_libritts/125/121124/125_121124_000021_000001.wav|But whom are you seeking, Debray?"|125
123 | /path_to_libritts/2092/145706/2092_145706_000054_000000.wav|Quick as lightning the wolf flew round the wood, and in a minute many hundred wolves rose up before him, increasing in number every moment, till they could be counted by thousands.|2092
124 | /path_to_libritts/8238/274553/8238_274553_000018_000000.wav|Your most humble servant,|8238
125 | /path_to_libritts/2836/5355/2836_5355_000085_000000.wav|"Don't I know it?|2836
126 | /path_to_libritts/587/41619/587_41619_000035_000002.wav|Besides, Miss Pierson is too short.|587
127 | /path_to_libritts/6272/70171/6272_70171_000011_000002.wav|In the morning, she said, we should see her three children. She never left them, she was so afraid of their being ill, also telling mother that she would do all in her power to make my stay in Rosville pleasant and profitable.|6272
128 | /path_to_libritts/2436/2481/2436_2481_000005_000001.wav|Both were closed.|2436
129 | /path_to_libritts/2911/7601/2911_7601_000009_000004.wav|I say I knew it well. I knew what the old man felt, and pitied him, although I chuckled at heart.|2911
130 | /path_to_libritts/6064/56165/6064_56165_000059_000000.wav|"She's invited to my cooking party next week," said Nora.|6064
131 | /path_to_libritts/6272/70168/6272_70168_000025_000000.wav|A smile crept into her blue eye, as she said: "My hearing him, or not, would make no difference, since God could hear and answer."|6272
132 | /path_to_libritts/1502/122615/1502_122615_000008_000001.wav|He would greatly have preferred silence and meditation to speech, when a discovery of his real condition might prove so instantly fatal.|1502
133 | /path_to_libritts/3983/5371/3983_5371_000025_000003.wav|What on earth had put him into that state?|3983
134 | /path_to_libritts/6848/76049/6848_76049_000005_000012.wav|Old Grammont had struck the table sharply and the eyes that looked out of his mask had blazed.|6848
135 | /path_to_libritts/460/172359/460_172359_000049_000001.wav|It was all over the town in a minutes.|460
136 | /path_to_libritts/8838/298545/8838_298545_000053_000000.wav|"I said that he was a welter weight."|8838
137 | /path_to_libritts/3983/5331/3983_5331_000042_000002.wav|"Have you told me all?" he asked presently, lifting them.|3983
138 | /path_to_libritts/5322/7680/5322_7680_000047_000001.wav|If you don't you will offend me.|5322
139 | /path_to_libritts/2196/170379/2196_170379_000014_000001.wav|This is a legitimate use of regression although it is not used so much these days to uncover past traumatic incidents.|2196
140 | /path_to_libritts/118/47824/118_47824_000101_000000.wav|"I won't blame Carlos for that," Bobby muttered.|118
141 | /path_to_libritts/8088/284756/8088_284756_000135_000001.wav|Something she had drawn from her girdle shone palely in her hand.|8088
142 | /path_to_libritts/6415/116629/6415_116629_000036_000007.wav|Dietrich was walking in steep and dangerous paths; that she was sure of, but he knew the straight road and would not his steps turn back to it again?|6415
143 | /path_to_libritts/2136/5143/2136_5143_000055_000002.wav|Let us talk of something else.'|2136
144 | /path_to_libritts/1246/124548/1246_124548_000020_000001.wav|You take a genuwine, honest-to-God homo Americanibus and there ain't anything he's afraid to tackle.|1246
145 | /path_to_libritts/5022/29405/5022_29405_000021_000003.wav|We will drive out after luncheon, and pay a round of visits." When this prospect was placed before me, I remembered having read in books of sensitive persons receiving impressions which made their blood run cold; I now found myself one of those persons, for the first time in my life.|5022
146 | /path_to_libritts/40/222/40_222_000011_000007.wav|She drew back, trying to beg their pardon, but was, with gentle violence, forced to return; and the others withdrew, after Eleanor had affectionately expressed a wish of being of use or comfort to her.|40
147 | /path_to_libritts/2836/5354/2836_5354_000056_000001.wav|"I cannot help the delay."|2836
148 | /path_to_libritts/887/123289/887_123289_000033_000000.wav|"Yes; to be sure we have.|887
149 | /path_to_libritts/7447/91186/7447_91186_000027_000005.wav|Spontaneous as was his creative power he was most painstaking in regard to the setting of his musical ideas and would often devote weeks to re-writing a single page that every detail might be perfect.|7447
150 | /path_to_libritts/8238/283452/8238_283452_000003_000000.wav|HUMOROUS GHOST STORIES|8238
151 | /path_to_libritts/7447/91187/7447_91187_000018_000000.wav|In regard to the much discussed tempo rubato of Chopin many and fatal blunders have been made.|7447
152 | /path_to_libritts/4018/103416/4018_103416_000070_000000.wav|"Heavens, how late it is!" she exclaimed.|4018
153 | /path_to_libritts/7067/76048/7067_76048_000064_000004.wav|They don't seem to know what they are doing.|7067
154 | /path_to_libritts/8838/298546/8838_298546_000013_000000.wav|The sheet of the paper which he held up was a lake of print around an islet of illustration.|8838
155 | /path_to_libritts/4297/13006/4297_13006_000049_000002.wav|He is well educated;--oh, so much better than most men that one meets.|4297
156 | /path_to_libritts/6415/116629/6415_116629_000036_000009.wav|She recalled the evening of the day when her husband was borne from the house to his burial.|6415
157 | /path_to_libritts/8088/284756/8088_284756_000167_000000.wav|There was a gray haze of mist everywhere.|8088
158 | /path_to_libritts/3879/174923/3879_174923_000019_000002.wav|Now Lord Chiltern was again his very intimate friend.|3879
159 | /path_to_libritts/5750/100289/5750_100289_000039_000004.wav|Or should we keep clear of these matters, avoid discussion of official methods and action, and simply aim at arousing racial pride and ambition along new lines, holding up a modern ideal for the support and encouragement of our youth?|5750
160 | /path_to_libritts/2136/5143/2136_5143_000018_000000.wav|'Poor mamma is buried there.'|2136
161 | /path_to_libritts/8629/261139/8629_261139_000033_000004.wav|Still, this would not seem to be reason enough for me to intrude upon her late at night with a plea for a large loan of money, had I not been in a desperate condition of mind, which made any attempt seem reasonable that promised relief from the unendurable burden of a pressing and disreputable debt.|8629
162 | /path_to_libritts/1963/147036/1963_147036_000017_000002.wav|She looked up at his words.|1963
163 | /path_to_libritts/8419/286676/8419_286676_000018_000001.wav|"I think the Ducks spoil their children," said she.|8419
164 | /path_to_libritts/2092/145706/2092_145706_000040_000001.wav|Before to-morrow night all the grain in the kingdom has to be gathered into one big heap, and if as much as a stalk of corn is wanting I must pay for it with my life.'|2092
165 | /path_to_libritts/2416/152139/2416_152139_000083_000000.wav|An instant Jimmie Dale watched the other, then he picked up the sheet of paper.|2416
166 | /path_to_libritts/669/129074/669_129074_000024_000007.wav|Isn't the whole course of life made up of such?|669
167 | /path_to_libritts/7402/90848/7402_90848_000060_000001.wav|He had arrived half an hour late, but he could have that half-hour back again!|7402
168 | /path_to_libritts/8088/284756/8088_284756_000163_000002.wav|One of the occupants of the room was a very old man; his face was wrinkled, and his hair was silvery.|8088
169 | /path_to_libritts/2136/5147/2136_5147_000051_000000.wav|But it was impossible long to be vexed with Cousin Monica.|2136
170 | /path_to_libritts/6818/68772/6818_68772_000026_000002.wav|Forbes, who never earned a dollar in his life, but inherited his money, is trying to take the dollars out of the pockets of the farmers by depriving them of the income derived by selling spaces for advertising signs.|6818
171 | /path_to_libritts/3664/11714/3664_11714_000020_000003.wav|Yet, although hitherto he had bowed his head before the authority of the Church, he had already raised it against the temporal power.|3664
172 | /path_to_libritts/78/369/78_369_000017_000001.wav|I know not whether the fiend possessed the same advantages, but I found that, as before I had daily lost ground in the pursuit, I now gained on him, so much so that when I first saw the ocean he was but one day's journey in advance, and I hoped to intercept him before he should reach the beach.|78
173 | /path_to_libritts/4195/186238/4195_186238_000011_000000.wav|Another hour passed.|4195
174 | /path_to_libritts/8088/284756/8088_284756_000077_000003.wav|The apparatus is strewn all over the place."|8088
175 | /path_to_libritts/8088/284756/8088_284756_000055_000005.wav|Then she saw the pool.|8088
176 | /path_to_libritts/8088/284756/8088_284756_000085_000000.wav|"Yes, that's very true: Carson is a most decent sort of chap." The words were not spoken.|8088
177 | /path_to_libritts/4195/186236/4195_186236_000009_000000.wav|"Good-by to my five thousand," said Uncle John, with his chuckling laugh.|4195
178 | /path_to_libritts/4160/14187/4160_14187_000035_000001.wav|"A sound thinker gives equal consideration to the probable and the improbable."|4160
179 | /path_to_libritts/4297/13009/4297_13009_000003_000003.wav|They were habitually indifferent to self-exaltation, and allowed themselves to be thrust into this or that unfitting role, professing that the Queen's Government and the good of the country were their only considerations.|4297
180 | /path_to_libritts/1116/132847/1116_132847_000014_000003.wav|The stick I shall keep for myself, so that I can fly to you if ever you have need of me.'|1116


--------------------------------------------------------------------------------
/filelists/ljs_audiopaths_text_sid_val_filelist.txt:
--------------------------------------------------------------------------------
 1 | /path_to_ljs/LJ045-0227.wav|and suggests the possibility, as did his note to his wife just prior to the attempt on General Walker, that he did not expect to escape at all.|0
 2 | /path_to_ljs/LJ043-0073.wav|On October nine, nineteen sixty-two he went to the Dallas office of the Texas Employment Commission|0
 3 | /path_to_ljs/LJ021-0054.wav|that the way to wealth is through work.|0
 4 | /path_to_ljs/LJ036-0134.wav|I asked him where he wanted to go. And he said, "five hundred North Beckley. Well, I started up,|0
 5 | /path_to_ljs/LJ045-0069.wav|Katherine Ford, with whom Marina Oswald stayed during her separation from her husband in November of nineteen sixty-two,|0
 6 | /path_to_ljs/LJ006-0134.wav|The governor himself admitted that a prisoner of weak intellect who had been severely beaten and much injured by a wardsman did not dare complain|0
 7 | /path_to_ljs/LJ047-0168.wav|and that he had given as his address Mrs. Paine's residence in Irving.|0
 8 | /path_to_ljs/LJ035-0136.wav|Lovelady and Shelley moved out into the street.|0
 9 | /path_to_ljs/LJ028-0219.wav|All the people of Babylon prostrated themselves before him, and, kissing his feet, rejoiced in his sovereignty, while happiness shone on their faces.|0
10 | /path_to_ljs/LJ032-0108.wav|Oswald listed a "Sgt. Robert Hidell" as a reference on one job application and "George Hidell" as a reference on another.|0
11 | /path_to_ljs/LJ009-0015.wav|I will quote an extract from the reverend gentleman's own journal.|0
12 | /path_to_ljs/LJ036-0154.wav|The walk from Beckley and Neely to ten twenty-six North Beckley was timed by Commission counsel at five minutes and forty-five seconds.|0
13 | /path_to_ljs/LJ048-0055.wav|a more alert and carefully considered treatment of the Oswald case by the Bureau might have brought about such a referral.|0
14 | /path_to_ljs/LJ013-0063.wav|The other, remaining unclaimed for ten years, was transferred at the end of that time to the commissioners for the reduction of the National Debt.|0
15 | /path_to_ljs/LJ033-0207.wav|In light of the other evidence linking Lee Harvey Oswald, the blanket, and the rifle to the paper bag found on the sixth floor,|0
16 | /path_to_ljs/LJ037-0025.wav|Benavides saw a man standing at the right side of the parked police car. He then heard three shots and saw the policeman fall to the ground.|0
17 | /path_to_ljs/LJ019-0320.wav|was the boon to which willing industry extending over a long period established a certain claim.|0
18 | /path_to_ljs/LJ041-0095.wav|Oswald read a good deal, said Powers, but, quote, he would never be reading any of the shoot-em-up westerns or anything like that.|0
19 | /path_to_ljs/LJ016-0048.wav|a distance of eight or nine feet.|0
20 | /path_to_ljs/LJ027-0026.wav|lead to a preview of certain principles of adaptation, necessary for their interpretation.|0
21 | /path_to_ljs/LJ024-0127.wav|You who know me can have no fear that I would tolerate the destruction by any branch of government of any part of our heritage of freedom.|0
22 | /path_to_ljs/LJ002-0241.wav|The court of the Marshalsea was instituted by Charles the first in the sixth year of his reign,|0
23 | /path_to_ljs/LJ002-0314.wav|was after eighteen oh seven, through the exertions of the keeper of the jail, spent in the purchase of necessaries.|0
24 | /path_to_ljs/LJ004-0221.wav|The food was properly prepared in the prison kitchen.|0
25 | /path_to_ljs/LJ024-0065.wav|Such a succession of appointments should have provided a Court well-balanced as to age.|0
26 | /path_to_ljs/LJ003-0030.wav|These subordinate chiefs were also rewarded out of the scanty prison rations.|0
27 | /path_to_ljs/LJ042-0075.wav|testified that Oswald was extremely sure of himself and seemed, quote, to know what his mission was.|0
28 | /path_to_ljs/LJ018-0059.wav|He saw Mr. Briggs' watch-chain, and followed him instantly into the carriage, determined to have it at all costs.|0
29 | /path_to_ljs/LJ015-0002.wav|The course of the swindlers was by no means smooth, but it was not till eighteen fifty-four that suspicion arose that anything was wrong.|0
30 | /path_to_ljs/LJ049-0177.wav|and Robert I. Bouck, who was in charge of the Protective Research Section of the Secret Service, believed that the accumulation of the facts known to the FBI|0
31 | /path_to_ljs/LJ044-0028.wav|She testified that he threatened to beat her if she did not do so. The chapter had never been chartered by the national FPCC organization.|0
32 | /path_to_ljs/LJ022-0185.wav|The answer to this demand was the Federal Reserve System.|0
33 | /path_to_ljs/LJ010-0303.wav|According to Fauntleroy's own case, he found at once that the firm was heavily involved,|0
34 | /path_to_ljs/LJ030-0236.wav|I quickly observed unnatural movement of crowds, like ducking or scattering, and quick movements in the Presidential follow-up car.|0
35 | /path_to_ljs/LJ018-0204.wav|Where it had lain was a yawning gulf or trap sufficient to do for the whole body of police engaged in the capture.|0
36 | /path_to_ljs/LJ024-0058.wav|or make independent on upon the desire or prejudice of any individual justice?|0
37 | /path_to_ljs/LJ023-0130.wav|was to infuse new blood into all our courts.|0
38 | /path_to_ljs/LJ001-0101.wav|It is discouraging to note that the improvement of the last fifty years is almost wholly confined to Great Britain.|0
39 | /path_to_ljs/LJ020-0061.wav|Break the rolls apart from one another and eat warm. They are also good cold, and if the directions be followed implicitly, very good always.|0
40 | /path_to_ljs/LJ005-0118.wav|In many others there were no infirmaries, no places set apart for the confinement of prisoners afflicted with dangerous and infectious disorders.|0
41 | /path_to_ljs/LJ047-0072.wav|stating that he had distributed its pamphlets on the streets of Dallas. This information did not reach Agent Hosty in Dallas until June.|0
42 | /path_to_ljs/LJ050-0156.wav|this money would be used to compensate consultants, to lease standard equipment or to purchase specially designed pilot equipment.|0
43 | /path_to_ljs/LJ018-0040.wav|His trial followed at the next sessions of the Central Criminal Court, and ended in his conviction.|0
44 | /path_to_ljs/LJ030-0123.wav|Special Agent Glen A. Bennett once left his place inside the follow-up car to help keep the crowd away from the President's car.|0
45 | /path_to_ljs/LJ018-0336.wav|Webster's devices for disposing of the body of her victim will call to mind those of Theodore Gardelle,|0
46 | /path_to_ljs/LJ028-0299.wav|"Had I told thee," rejoined the other, "what I was bent on doing, thou wouldst not have suffered it;|0
47 | /path_to_ljs/LJ040-0013.wav|Oswald's complete state of mind and character are now outside of the power of man to know.|0
48 | /path_to_ljs/LJ013-0120.wav|whom he had shown over the plate closet.|0
49 | /path_to_ljs/LJ018-0247.wav|William Roupell himself was brought as a principal witness to clench the case by a confession altogether against himself.|0
50 | /path_to_ljs/LJ015-0106.wav|A reward was forthwith offered for Robson's apprehension.|0
51 | /path_to_ljs/LJ030-0050.wav|The Presidential limousine.|0
52 | /path_to_ljs/LJ031-0020.wav|two rooms were prepared.|0
53 | /path_to_ljs/LJ008-0281.wav|They found at Newgate, under disgraceful conditions as already described,|0
54 | /path_to_ljs/LJ002-0048.wav|two. The female debtors' side consisted of a court-yard forty-nine by sixteen feet,|0
55 | /path_to_ljs/LJ017-0085.wav|Palmer's plan was to administer poison in quantities insufficient to cause death, but enough to produce illness which would account for death.|0
56 | /path_to_ljs/LJ045-0111.wav|They asked for Lee Oswald who was not called to the telephone because he was known by the other name.|0
57 | /path_to_ljs/LJ027-0144.wav|Now, it must be obvious|0
58 | /path_to_ljs/LJ040-0066.wav|It had its effect on Lee's mother, Marguerite, his brother Robert, who had been born in nineteen thirty-four,|0


--------------------------------------------------------------------------------
/fp16_optimizer.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from torch.autograd import Variable
  4 | from torch.nn.parameter import Parameter
  5 | from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
  6 | 
  7 | from loss_scaler import DynamicLossScaler, LossScaler
  8 | 
  9 | FLOAT_TYPES = (torch.FloatTensor, torch.cuda.FloatTensor)
 10 | HALF_TYPES = (torch.HalfTensor, torch.cuda.HalfTensor)
 11 | 
 12 | def conversion_helper(val, conversion):
 13 |     """Apply conversion to val. Recursively apply conversion if `val` is a nested tuple/list structure."""
 14 |     if not isinstance(val, (tuple, list)):
 15 |         return conversion(val)
 16 |     rtn =  [conversion_helper(v, conversion) for v in val]
 17 |     if isinstance(val, tuple):
 18 |         rtn = tuple(rtn)
 19 |     return rtn
 20 | 
 21 | def fp32_to_fp16(val):
 22 |     """Convert fp32 `val` to fp16"""
 23 |     def half_conversion(val):
 24 |         val_typecheck = val
 25 |         if isinstance(val_typecheck, (Parameter, Variable)):
 26 |             val_typecheck = val.data
 27 |         if isinstance(val_typecheck, FLOAT_TYPES):
 28 |             val = val.half()
 29 |         return val
 30 |     return conversion_helper(val, half_conversion)
 31 | 
 32 | def fp16_to_fp32(val):
 33 |     """Convert fp16 `val` to fp32"""
 34 |     def float_conversion(val):
 35 |         val_typecheck = val
 36 |         if isinstance(val_typecheck, (Parameter, Variable)):
 37 |             val_typecheck = val.data
 38 |         if isinstance(val_typecheck, HALF_TYPES):
 39 |             val = val.float()
 40 |         return val
 41 |     return conversion_helper(val, float_conversion)
 42 | 
 43 | class FP16_Module(nn.Module):
 44 |     def __init__(self, module):
 45 |         super(FP16_Module, self).__init__()
 46 |         self.add_module('module', module.half())
 47 | 
 48 |     def forward(self, *inputs, **kwargs):
 49 |         return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
 50 | 
 51 | class FP16_Optimizer(object):
 52 |     """
 53 |     FP16_Optimizer is designed to wrap an existing PyTorch optimizer,
 54 |     and enable an fp16 model to be trained using a master copy of fp32 weights.
 55 | 
 56 |     Args:
 57 |         optimizer (torch.optim.optimizer):  Existing optimizer containing initialized fp16 parameters.  Internally, FP16_Optimizer replaces the passed optimizer's fp16 parameters with new fp32 parameters copied from the original ones.  FP16_Optimizer also stores references to the original fp16 parameters, and updates these fp16 parameters from the master fp32 copy after each step.
 58 |         static_loss_scale (float, optional, default=1.0):  Loss scale used internally to scale fp16 gradients computed by the model.  Scaled gradients will be copied to fp32, then downscaled before being applied to the fp32 master params, so static_loss_scale should not affect learning rate.
 59 |         dynamic_loss_scale (bool, optional, default=False):  Use dynamic loss scaling.  If True, this will override any static_loss_scale option.
 60 | 
 61 |     """
 62 | 
 63 |     def __init__(self, optimizer, static_loss_scale=1.0, dynamic_loss_scale=False):
 64 |         if not torch.cuda.is_available:
 65 |             raise SystemError('Cannot use fp16 without CUDA')
 66 | 
 67 |         self.fp16_param_groups = []
 68 |         self.fp32_param_groups = []
 69 |         self.fp32_flattened_groups = []
 70 |         for i, param_group in enumerate(optimizer.param_groups):
 71 |             print("FP16_Optimizer processing param group {}:".format(i))
 72 |             fp16_params_this_group = []
 73 |             fp32_params_this_group = []
 74 |             for param in param_group['params']:
 75 |                 if param.requires_grad:
 76 |                     if param.type() == 'torch.cuda.HalfTensor':
 77 |                         print("FP16_Optimizer received torch.cuda.HalfTensor with {}"
 78 |                               .format(param.size()))
 79 |                         fp16_params_this_group.append(param)
 80 |                     elif param.type() == 'torch.cuda.FloatTensor':
 81 |                         print("FP16_Optimizer received torch.cuda.FloatTensor with {}"
 82 |                               .format(param.size()))
 83 |                         fp32_params_this_group.append(param)
 84 |                     else:
 85 |                         raise TypeError("Wrapped parameters must be either "
 86 |                                         "torch.cuda.FloatTensor or torch.cuda.HalfTensor. "
 87 |                                         "Received {}".format(param.type()))
 88 | 
 89 |             fp32_flattened_this_group = None
 90 |             if len(fp16_params_this_group) > 0:
 91 |                 fp32_flattened_this_group = _flatten_dense_tensors(
 92 |                     [param.detach().data.clone().float() for param in fp16_params_this_group])
 93 | 
 94 |                 fp32_flattened_this_group = Variable(fp32_flattened_this_group, requires_grad = True)
 95 | 
 96 |                 fp32_flattened_this_group.grad = fp32_flattened_this_group.new(
 97 |                     *fp32_flattened_this_group.size())
 98 | 
 99 |             # python's lovely list concatenation via +
100 |             if fp32_flattened_this_group is not None:
101 |                 param_group['params'] = [fp32_flattened_this_group] + fp32_params_this_group
102 |             else:
103 |                 param_group['params'] = fp32_params_this_group
104 | 
105 |             self.fp16_param_groups.append(fp16_params_this_group)
106 |             self.fp32_param_groups.append(fp32_params_this_group)
107 |             self.fp32_flattened_groups.append(fp32_flattened_this_group)
108 | 
109 |         # print("self.fp32_flattened_groups = ", self.fp32_flattened_groups)
110 |         # print("self.fp16_param_groups = ", self.fp16_param_groups)
111 | 
112 |         self.optimizer = optimizer.__class__(optimizer.param_groups)
113 | 
114 |         # self.optimizer.load_state_dict(optimizer.state_dict())
115 | 
116 |         self.param_groups = self.optimizer.param_groups
117 | 
118 |         if dynamic_loss_scale:
119 |             self.dynamic_loss_scale = True
120 |             self.loss_scaler = DynamicLossScaler()
121 |         else:
122 |             self.dynamic_loss_scale = False
123 |             self.loss_scaler = LossScaler(static_loss_scale)
124 | 
125 |         self.overflow = False
126 |         self.first_closure_call_this_step = True
127 | 
128 |     def zero_grad(self):
129 |         """
130 |         Zero fp32 and fp16 parameter grads.
131 |         """
132 |         self.optimizer.zero_grad()
133 |         for fp16_group in self.fp16_param_groups:
134 |             for param in fp16_group:
135 |                 if param.grad is not None:
136 |                     param.grad.detach_() # This does appear in torch.optim.optimizer.zero_grad(),
137 |                                          # but I'm not sure why it's needed.
138 |                     param.grad.zero_()
139 | 
140 |     def _check_overflow(self):
141 |         params = []
142 |         for group in self.fp16_param_groups:
143 |             for param in group:
144 |                 params.append(param)
145 |         for group in self.fp32_param_groups:
146 |             for param in group:
147 |                 params.append(param)
148 |         self.overflow = self.loss_scaler.has_overflow(params)
149 | 
150 |     def _update_scale(self, has_overflow=False):
151 |         self.loss_scaler.update_scale(has_overflow)
152 | 
153 |     def _copy_grads_fp16_to_fp32(self):
154 |         for fp32_group, fp16_group in zip(self.fp32_flattened_groups, self.fp16_param_groups):
155 |             if len(fp16_group) > 0:
156 |                 # This might incur one more deep copy than is necessary.
157 |                 fp32_group.grad.data.copy_(
158 |                     _flatten_dense_tensors([fp16_param.grad.data for fp16_param in fp16_group]))
159 | 
160 |     def _downscale_fp32(self):
161 |         if self.loss_scale != 1.0:
162 |             for param_group in self.optimizer.param_groups:
163 |                 for param in param_group['params']:
164 |                     param.grad.data.mul_(1./self.loss_scale)
165 | 
166 |     def clip_fp32_grads(self, clip=-1):
167 |         if not self.overflow:
168 |             fp32_params = []
169 |             for param_group in self.optimizer.param_groups:
170 |                 for param in param_group['params']:
171 |                     fp32_params.append(param)
172 |             if clip > 0:
173 |                 return torch.nn.utils.clip_grad_norm(fp32_params, clip)
174 | 
175 |     def _copy_params_fp32_to_fp16(self):
176 |         for fp16_group, fp32_group in zip(self.fp16_param_groups, self.fp32_flattened_groups):
177 |             if len(fp16_group) > 0:
178 |                 for fp16_param, fp32_data in zip(fp16_group,
179 |                     _unflatten_dense_tensors(fp32_group.data, fp16_group)):
180 |                     fp16_param.data.copy_(fp32_data)
181 | 
182 |     def state_dict(self):
183 |         """
184 |         Returns a dict containing the current state of this FP16_Optimizer instance.
185 |         This dict contains attributes of FP16_Optimizer, as well as the state_dict
186 |         of the contained Pytorch optimizer.
187 | 
188 |         Untested.
189 |         """
190 |         state_dict = {}
191 |         state_dict['loss_scaler'] = self.loss_scaler
192 |         state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale
193 |         state_dict['overflow'] = self.overflow
194 |         state_dict['first_closure_call_this_step'] = self.first_closure_call_this_step
195 |         state_dict['optimizer_state_dict'] = self.optimizer.state_dict()
196 |         return state_dict
197 | 
198 |     def load_state_dict(self, state_dict):
199 |         """
200 |         Loads a state_dict created by an earlier call to state_dict.
201 | 
202 |         Untested.
203 |         """
204 |         self.loss_scaler = state_dict['loss_scaler']
205 |         self.dynamic_loss_scale = state_dict['dynamic_loss_scale']
206 |         self.overflow = state_dict['overflow']
207 |         self.first_closure_call_this_step = state_dict['first_closure_call_this_step']
208 |         self.optimizer.load_state_dict(state_dict['optimizer_state_dict'])
209 | 
210 |     def step(self, closure=None): # could add clip option.
211 |         """
212 |         If no closure is supplied, step should be called after fp16_optimizer_obj.backward(loss).
213 |         step updates the fp32 master copy of parameters using the optimizer supplied to
214 |         FP16_Optimizer's constructor, then copies the updated fp32 params into the fp16 params
215 |         originally referenced by Fp16_Optimizer's constructor, so the user may immediately run
216 |         another forward pass using their model.
217 | 
218 |         If a closure is supplied, step may be called without a prior call to self.backward(loss).
219 |         However, the user should take care that any loss.backward() call within the closure
220 |         has been replaced by fp16_optimizer_obj.backward(loss).
221 | 
222 |         Args:
223 |            closure (optional):  Closure that will be supplied to the underlying optimizer originally passed to FP16_Optimizer's constructor.  closure should call zero_grad on the FP16_Optimizer object, compute the loss, call .backward(loss), and return the loss.
224 | 
225 |         Closure example::
226 | 
227 |             # optimizer is assumed to be an FP16_Optimizer object, previously constructed from an
228 |             # existing pytorch optimizer.
229 |             for input, target in dataset:
230 |                 def closure():
231 |                     optimizer.zero_grad()
232 |                     output = model(input)
233 |                     loss = loss_fn(output, target)
234 |                     optimizer.backward(loss)
235 |                     return loss
236 |                 optimizer.step(closure)
237 | 
238 |         .. note::
239 |             The only changes that need to be made compared to
240 |             `ordinary optimizer closures`_ are that "optimizer" itself should be an instance of
241 |             FP16_Optimizer, and that the call to loss.backward should be replaced by
242 |             optimizer.backward(loss).
243 | 
244 |         .. warning::
245 |             Currently, calling step with a closure is not compatible with dynamic loss scaling.
246 | 
247 |         .. _`ordinary optimizer closures`:
248 |             http://pytorch.org/docs/master/optim.html#optimizer-step-closure
249 |         """
250 |         if closure is not None and isinstance(self.loss_scaler, DynamicLossScaler):
251 |             raise TypeError("Using step with a closure is currently not "
252 |                             "compatible with dynamic loss scaling.")
253 | 
254 |         scale = self.loss_scaler.loss_scale
255 |         self._update_scale(self.overflow)
256 | 
257 |         if self.overflow:
258 |             print("OVERFLOW! Skipping step. Attempted loss scale: {}".format(scale))
259 |             return
260 | 
261 |         if closure is not None:
262 |             self._step_with_closure(closure)
263 |         else:
264 |             self.optimizer.step()
265 | 
266 |         self._copy_params_fp32_to_fp16()
267 | 
268 |         return
269 | 
270 |     def _step_with_closure(self, closure):
271 |         def wrapped_closure():
272 |             if self.first_closure_call_this_step:
273 |                 """
274 |                 We expect that the fp16 params are initially fresh on entering self.step(),
275 |                 so _copy_params_fp32_to_fp16() is unnecessary the first time wrapped_closure()
276 |                 is called within self.optimizer.step().
277 |                 """
278 |                 self.first_closure_call_this_step = False
279 |             else:
280 |                 """
281 |                 If self.optimizer.step() internally calls wrapped_closure more than once,
282 |                 it may update the fp32 params after each call.  However, self.optimizer
283 |                 doesn't know about the fp16 params at all.  If the fp32 params get updated,
284 |                 we can't rely on self.optimizer to refresh the fp16 params.  We need
285 |                 to handle that manually:
286 |                 """
287 |                 self._copy_params_fp32_to_fp16()
288 | 
289 |             """
290 |             Our API expects the user to give us ownership of the backward() call by
291 |             replacing all calls to loss.backward() with optimizer.backward(loss).
292 |             This requirement holds whether or not the call to backward() is made within
293 |             a closure.
294 |             If the user is properly calling optimizer.backward(loss) within "closure,"
295 |             calling closure() here will give the fp32 master params fresh gradients
296 |             for the optimizer to play with,
297 |             so all wrapped_closure needs to do is call closure() and return the loss.
298 |             """
299 |             temp_loss = closure()
300 |             return temp_loss
301 | 
302 |         self.optimizer.step(wrapped_closure)
303 | 
304 |         self.first_closure_call_this_step = True
305 | 
306 |     def backward(self, loss, update_fp32_grads=True):
307 |         """
308 |         fp16_optimizer_obj.backward performs the following conceptual operations:
309 | 
310 |         fp32_loss = loss.float() (see first Note below)
311 | 
312 |         scaled_loss = fp32_loss*loss_scale
313 | 
314 |         scaled_loss.backward(), which accumulates scaled gradients into the .grad attributes of the
315 |         fp16 model's leaves.
316 | 
317 |         fp16 grads are then copied to the stored fp32 params' .grad attributes (see second Note).
318 | 
319 |         Finally, fp32 grads are divided by loss_scale.
320 | 
321 |         In this way, after fp16_optimizer_obj.backward, the fp32 parameters have fresh gradients,
322 |         and fp16_optimizer_obj.step may be called.
323 | 
324 |         .. note::
325 |             Converting the loss to fp32 before applying the loss scale provides some
326 |             additional safety against overflow if the user has supplied an fp16 value.
327 |             However, for maximum overflow safety, the user should
328 |             compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
329 |             fp16_optimizer_obj.backward.
330 | 
331 |         .. note::
332 |             The gradients found in an fp16 model's leaves after a call to
333 |             fp16_optimizer_obj.backward should not be regarded as valid in general,
334 |             because it's possible
335 |             they have been scaled (and in the case of dynamic loss scaling,
336 |             the scale factor may silently change over time).
337 |             If the user wants to inspect gradients after a call to fp16_optimizer_obj.backward,
338 |             he/she should query the .grad attribute of FP16_Optimizer's stored fp32 parameters.
339 | 
340 |         Args:
341 |             loss:  The loss output by the user's model.  loss may be either float or half (but see first Note above).
342 |             update_fp32_grads (bool, optional, default=True):  Option to copy fp16 grads to fp32 grads on this call.  By setting this to False, the user can delay this copy, which is useful to eliminate redundant fp16->fp32 grad copies if fp16_optimizer_obj.backward is being called on multiple losses in one iteration.  If set to False, the user becomes responsible for calling fp16_optimizer_obj.update_fp32_grads before calling fp16_optimizer_obj.step.
343 | 
344 |         Example::
345 | 
346 |             # Ordinary operation:
347 |             optimizer.backward(loss)
348 | 
349 |             # Naive operation with multiple losses (technically valid, but less efficient):
350 |             # fp32 grads will be correct after the second call,  but
351 |             # the first call incurs an unnecessary fp16->fp32 grad copy.
352 |             optimizer.backward(loss1)
353 |             optimizer.backward(loss2)
354 | 
355 |             # More efficient way to handle multiple losses:
356 |             # The fp16->fp32 grad copy is delayed until fp16 grads from all
357 |             # losses have been accumulated.
358 |             optimizer.backward(loss1, update_fp32_grads=False)
359 |             optimizer.backward(loss2, update_fp32_grads=False)
360 |             optimizer.update_fp32_grads()
361 |         """
362 |         self.loss_scaler.backward(loss.float())
363 |         if update_fp32_grads:
364 |             self.update_fp32_grads()
365 | 
366 |     def update_fp32_grads(self):
367 |         """
368 |         Copy the .grad attribute from stored references to fp16 parameters to
369 |         the .grad attribute of the master fp32 parameters that are directly
370 |         updated by the optimizer.  :attr:`update_fp32_grads` only needs to be called if
371 |         fp16_optimizer_obj.backward was called with update_fp32_grads=False.
372 |         """
373 |         if self.dynamic_loss_scale:
374 |             self._check_overflow()
375 |             if self.overflow: return
376 |         self._copy_grads_fp16_to_fp32()
377 |         self._downscale_fp32()
378 | 
379 |     @property
380 |     def loss_scale(self):
381 |         return self.loss_scaler.loss_scale
382 | 


--------------------------------------------------------------------------------
/hparams.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | from text.symbols import symbols
  3 | 
  4 | 
  5 | def create_hparams(hparams_string=None, verbose=False):
  6 |     """Create model hyperparameters. Parse nondefault from given string."""
  7 | 
  8 |     hparams = tf.contrib.training.HParams(
  9 |         ################################
 10 |         # Experiment Parameters        #
 11 |         ################################
 12 |         epochs=50000,
 13 |         iters_per_checkpoint=500,
 14 |         seed=1234,
 15 |         dynamic_loss_scaling=True,
 16 |         fp16_run=False,
 17 |         distributed_run=False,
 18 |         dist_backend="nccl",
 19 |         dist_url="tcp://localhost:54321",
 20 |         cudnn_enabled=True,
 21 |         cudnn_benchmark=False,
 22 |         ignore_layers=['speaker_embedding.weight'],
 23 | 
 24 |         ################################
 25 |         # Data Parameters             #
 26 |         ################################
 27 |         training_files='filelists/ljs_audiopaths_text_sid_train_filelist.txt',
 28 |         validation_files='filelists/ljs_audiopaths_text_sid_val_filelist.txt',
 29 |         text_cleaners=['english_cleaners'],
 30 |         p_arpabet=1.0,
 31 |         cmudict_path="data/cmu_dictionary",
 32 | 
 33 |         ################################
 34 |         # Audio Parameters             #
 35 |         ################################
 36 |         max_wav_value=32768.0,
 37 |         sampling_rate=22050,
 38 |         filter_length=1024,
 39 |         hop_length=256,
 40 |         win_length=1024,
 41 |         n_mel_channels=80,
 42 |         mel_fmin=0.0,
 43 |         mel_fmax=8000.0,
 44 |         f0_min=80,
 45 |         f0_max=880,
 46 |         harm_thresh=0.25,
 47 | 
 48 |         ################################
 49 |         # Model Parameters             #
 50 |         ################################
 51 |         n_symbols=len(symbols),
 52 |         symbols_embedding_dim=512,
 53 | 
 54 |         # Encoder parameters
 55 |         encoder_kernel_size=5,
 56 |         encoder_n_convolutions=3,
 57 |         encoder_embedding_dim=512,
 58 | 
 59 |         # Decoder parameters
 60 |         n_frames_per_step=1,  # currently only 1 is supported
 61 |         decoder_rnn_dim=1024,
 62 |         prenet_dim=256,
 63 |         prenet_f0_n_layers=1,
 64 |         prenet_f0_dim=1,
 65 |         prenet_f0_kernel_size=1,
 66 |         prenet_rms_dim=0,
 67 |         prenet_rms_kernel_size=1,
 68 |         max_decoder_steps=1000,
 69 |         gate_threshold=0.5,
 70 |         p_attention_dropout=0.1,
 71 |         p_decoder_dropout=0.1,
 72 |         p_teacher_forcing=1.0,
 73 | 
 74 |         # Attention parameters
 75 |         attention_rnn_dim=1024,
 76 |         attention_dim=128,
 77 | 
 78 |         # Location Layer parameters
 79 |         attention_location_n_filters=32,
 80 |         attention_location_kernel_size=31,
 81 | 
 82 |         # Mel-post processing network parameters
 83 |         postnet_embedding_dim=512,
 84 |         postnet_kernel_size=5,
 85 |         postnet_n_convolutions=5,
 86 | 
 87 |         # Speaker embedding
 88 |         n_speakers=123,
 89 |         speaker_embedding_dim=128,
 90 | 
 91 |         # Reference encoder
 92 |         with_gst=True,
 93 |         ref_enc_filters=[32, 32, 64, 64, 128, 128],
 94 |         ref_enc_size=[3, 3],
 95 |         ref_enc_strides=[2, 2],
 96 |         ref_enc_pad=[1, 1],
 97 |         ref_enc_gru_size=128,
 98 | 
 99 |         # Style Token Layer
100 |         token_embedding_size=256,
101 |         token_num=10,
102 |         num_heads=8,
103 | 
104 |         ################################
105 |         # Optimization Hyperparameters #
106 |         ################################
107 |         use_saved_learning_rate=False,
108 |         learning_rate=1e-3,
109 |         learning_rate_min=1e-5,
110 |         learning_rate_anneal=50000,
111 |         weight_decay=1e-6,
112 |         grad_clip_thresh=1.0,
113 |         batch_size=32,
114 |         mask_padding=True,  # set model's padded outputs to padded values
115 | 
116 |     )
117 | 
118 |     if hparams_string:
119 |         tf.compat.v1.logging.info('Parsing command line hparams: %s', hparams_string)
120 |         hparams.parse(hparams_string)
121 | 
122 |     if verbose:
123 |         tf.compat.v1.logging.info('Final parsed hparams: %s', hparams.values())
124 | 
125 |     return hparams
126 | 


--------------------------------------------------------------------------------
/layers.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from librosa.filters import mel as librosa_mel_fn
 3 | from audio_processing import dynamic_range_compression, dynamic_range_decompression
 4 | from stft import STFT
 5 | 
 6 | 
 7 | class LinearNorm(torch.nn.Module):
 8 |     def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
 9 |         super(LinearNorm, self).__init__()
10 |         self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
11 | 
12 |         torch.nn.init.xavier_uniform_(
13 |             self.linear_layer.weight,
14 |             gain=torch.nn.init.calculate_gain(w_init_gain))
15 | 
16 |     def forward(self, x):
17 |         return self.linear_layer(x)
18 | 
19 | 
20 | class ConvNorm(torch.nn.Module):
21 |     def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
22 |                  padding=None, dilation=1, bias=True, w_init_gain='linear'):
23 |         super(ConvNorm, self).__init__()
24 |         if padding is None:
25 |             assert(kernel_size % 2 == 1)
26 |             padding = int(dilation * (kernel_size - 1) / 2)
27 |         self.conv = torch.nn.Conv1d(in_channels, out_channels,
28 |                                     kernel_size=kernel_size, stride=stride,
29 |                                     padding=padding, dilation=dilation,
30 |                                     bias=bias)
31 |         torch.nn.init.xavier_uniform_(
32 |             self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
33 | 
34 |     def forward(self, signal):
35 |         conv_signal = self.conv(signal)
36 |         return conv_signal
37 | 
38 | 
39 | class ConvNorm2D(torch.nn.Module):
40 |     def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
41 |                  padding=None, dilation=1, bias=True, w_init_gain='linear'):
42 |         super(ConvNorm2D, self).__init__()
43 |         self.conv = torch.nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
44 |                                     kernel_size=kernel_size, stride=stride,
45 |                                     padding=padding, dilation=dilation,
46 |                                     groups=1, bias=bias)
47 |         torch.nn.init.xavier_uniform_(
48 |             self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
49 | 
50 |     def forward(self, signal):
51 |         conv_signal = self.conv(signal)
52 |         return conv_signal
53 | 
54 | 
55 | class TacotronSTFT(torch.nn.Module):
56 |     def __init__(self, filter_length=1024, hop_length=256, win_length=1024,
57 |                  n_mel_channels=80, sampling_rate=22050, mel_fmin=0.0,
58 |                  mel_fmax=8000.0):
59 |         super(TacotronSTFT, self).__init__()
60 |         self.n_mel_channels = n_mel_channels
61 |         self.sampling_rate = sampling_rate
62 |         self.stft_fn = STFT(filter_length, hop_length, win_length)
63 |         mel_basis = librosa_mel_fn(
64 |             sampling_rate, filter_length, n_mel_channels, mel_fmin, mel_fmax)
65 |         mel_basis = torch.from_numpy(mel_basis).float()
66 |         self.register_buffer('mel_basis', mel_basis)
67 | 
68 |     def spectral_normalize(self, magnitudes):
69 |         output = dynamic_range_compression(magnitudes)
70 |         return output
71 | 
72 |     def spectral_de_normalize(self, magnitudes):
73 |         output = dynamic_range_decompression(magnitudes)
74 |         return output
75 | 
76 |     def mel_spectrogram(self, y, ref_level_db = 20, magnitude_power=1.5):
77 |         """Computes mel-spectrograms from a batch of waves
78 |         PARAMS
79 |         ------
80 |         y: Variable(torch.FloatTensor) with shape (B, T) in range [-1, 1]
81 | 
82 |         RETURNS
83 |         -------
84 |         mel_output: torch.FloatTensor of shape (B, n_mel_channels, T)
85 |         """
86 |         assert(torch.min(y.data) >= -1)
87 |         assert(torch.max(y.data) <= 1)
88 | 
89 |         magnitudes, phases = self.stft_fn.transform(y)
90 |         magnitudes = magnitudes.data
91 |         mel_output = torch.matmul(self.mel_basis, magnitudes)
92 |         mel_output = self.spectral_normalize(mel_output)
93 |         return mel_output
94 | 


--------------------------------------------------------------------------------
/logger.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import torch
 3 | from tensorboardX import SummaryWriter
 4 | from plotting_utils import plot_alignment_to_numpy, plot_spectrogram_to_numpy
 5 | from plotting_utils import plot_gate_outputs_to_numpy
 6 | 
 7 | 
 8 | class Tacotron2Logger(SummaryWriter):
 9 |     def __init__(self, logdir):
10 |         super(Tacotron2Logger, self).__init__(logdir)
11 | 
12 |     def log_training(self, reduced_loss, grad_norm, learning_rate, duration,
13 |                      iteration):
14 |             self.add_scalar("training.loss", reduced_loss, iteration)
15 |             self.add_scalar("grad.norm", grad_norm, iteration)
16 |             self.add_scalar("learning.rate", learning_rate, iteration)
17 |             self.add_scalar("duration", duration, iteration)
18 | 
19 |     def log_validation(self, reduced_loss, model, y, y_pred, iteration):
20 |         self.add_scalar("validation.loss", reduced_loss, iteration)
21 |         _, mel_outputs, gate_outputs, alignments = y_pred
22 |         mel_targets, gate_targets = y
23 | 
24 |         # plot distribution of parameters
25 |         for tag, value in model.named_parameters():
26 |             tag = tag.replace('.', '/')
27 |             self.add_histogram(tag, value.data.cpu().numpy(), iteration)
28 | 
29 |         # plot alignment, mel target and predicted, gate target and predicted
30 |         idx = random.randint(0, alignments.size(0) - 1)
31 |         self.add_image(
32 |             "alignment",
33 |             plot_alignment_to_numpy(alignments[idx].data.cpu().numpy().T),
34 |             iteration, dataformats='HWC')
35 |         self.add_image(
36 |             "mel_target",
37 |             plot_spectrogram_to_numpy(mel_targets[idx].data.cpu().numpy()),
38 |             iteration, dataformats='HWC')
39 |         self.add_image(
40 |             "mel_predicted",
41 |             plot_spectrogram_to_numpy(mel_outputs[idx].data.cpu().numpy()),
42 |             iteration, dataformats='HWC')
43 |         self.add_image(
44 |             "gate",
45 |             plot_gate_outputs_to_numpy(
46 |                 gate_targets[idx].data.cpu().numpy(),
47 |                 torch.sigmoid(gate_outputs[idx]).data.cpu().numpy()),
48 |             iteration, dataformats='HWC')
49 | 


--------------------------------------------------------------------------------
/loss_function.py:
--------------------------------------------------------------------------------
 1 | from torch import nn
 2 | 
 3 | 
 4 | class Tacotron2Loss(nn.Module):
 5 |     def __init__(self):
 6 |         super(Tacotron2Loss, self).__init__()
 7 | 
 8 |     def forward(self, model_output, targets):
 9 |         mel_target, gate_target = targets[0], targets[1]
10 |         mel_target.requires_grad = False
11 |         gate_target.requires_grad = False
12 |         gate_target = gate_target.view(-1, 1)
13 | 
14 |         mel_out, mel_out_postnet, gate_out, _ = model_output
15 |         gate_out = gate_out.view(-1, 1)
16 |         mel_loss = nn.MSELoss()(mel_out, mel_target) + \
17 |             nn.MSELoss()(mel_out_postnet, mel_target)
18 |         gate_loss = nn.BCEWithLogitsLoss()(gate_out, gate_target)
19 |         return mel_loss + gate_loss
20 | 


--------------------------------------------------------------------------------
/loss_scaler.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | 
  3 | class LossScaler:
  4 | 
  5 |     def __init__(self, scale=1):
  6 |         self.cur_scale = scale
  7 | 
  8 |     # `params` is a list / generator of torch.Variable
  9 |     def has_overflow(self, params):
 10 |         return False
 11 | 
 12 |     # `x` is a torch.Tensor
 13 |     def _has_inf_or_nan(x):
 14 |         return False
 15 | 
 16 |     # `overflow` is boolean indicating whether we overflowed in gradient
 17 |     def update_scale(self, overflow):
 18 |         pass
 19 | 
 20 |     @property
 21 |     def loss_scale(self):
 22 |         return self.cur_scale
 23 | 
 24 |     def scale_gradient(self, module, grad_in, grad_out):
 25 |         return tuple(self.loss_scale * g for g in grad_in)
 26 | 
 27 |     def backward(self, loss):
 28 |         scaled_loss = loss*self.loss_scale
 29 |         scaled_loss.backward()
 30 | 
 31 | class DynamicLossScaler:
 32 | 
 33 |     def __init__(self,
 34 |                  init_scale=2**32,
 35 |                  scale_factor=2.,
 36 |                  scale_window=1000):
 37 |         self.cur_scale = init_scale
 38 |         self.cur_iter = 0
 39 |         self.last_overflow_iter = -1
 40 |         self.scale_factor = scale_factor
 41 |         self.scale_window = scale_window
 42 | 
 43 |     # `params` is a list / generator of torch.Variable
 44 |     def has_overflow(self, params):
 45 | #        return False
 46 |         for p in params:
 47 |             if p.grad is not None and DynamicLossScaler._has_inf_or_nan(p.grad.data):
 48 |                 return True
 49 | 
 50 |         return False
 51 | 
 52 |     # `x` is a torch.Tensor
 53 |     def _has_inf_or_nan(x):
 54 |         cpu_sum = float(x.float().sum())
 55 |         if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
 56 |             return True
 57 |         return False
 58 | 
 59 |     # `overflow` is boolean indicating whether we overflowed in gradient
 60 |     def update_scale(self, overflow):
 61 |         if overflow:
 62 |             #self.cur_scale /= self.scale_factor
 63 |             self.cur_scale = max(self.cur_scale/self.scale_factor, 1)
 64 |             self.last_overflow_iter = self.cur_iter
 65 |         else:
 66 |             if (self.cur_iter - self.last_overflow_iter) % self.scale_window == 0:
 67 |                 self.cur_scale *= self.scale_factor
 68 | #        self.cur_scale = 1
 69 |         self.cur_iter += 1
 70 | 
 71 |     @property
 72 |     def loss_scale(self):
 73 |         return self.cur_scale
 74 | 
 75 |     def scale_gradient(self, module, grad_in, grad_out):
 76 |         return tuple(self.loss_scale * g for g in grad_in)
 77 | 
 78 |     def backward(self, loss):
 79 |         scaled_loss = loss*self.loss_scale
 80 |         scaled_loss.backward()
 81 | 
 82 | ##############################################################
 83 | # Example usage below here -- assuming it's in a separate file
 84 | ##############################################################
 85 | if __name__ == "__main__":
 86 |     import torch
 87 |     from torch.autograd import Variable
 88 |     from dynamic_loss_scaler import DynamicLossScaler
 89 | 
 90 |     # N is batch size; D_in is input dimension;
 91 |     # H is hidden dimension; D_out is output dimension.
 92 |     N, D_in, H, D_out = 64, 1000, 100, 10
 93 | 
 94 |     # Create random Tensors to hold inputs and outputs, and wrap them in Variables.
 95 |     x = Variable(torch.randn(N, D_in), requires_grad=False)
 96 |     y = Variable(torch.randn(N, D_out), requires_grad=False)
 97 | 
 98 |     w1 = Variable(torch.randn(D_in, H), requires_grad=True)
 99 |     w2 = Variable(torch.randn(H, D_out), requires_grad=True)
100 |     parameters = [w1, w2]
101 | 
102 |     learning_rate = 1e-6
103 |     optimizer = torch.optim.SGD(parameters, lr=learning_rate)
104 |     loss_scaler = DynamicLossScaler()
105 | 
106 |     for t in range(500):
107 |         y_pred = x.mm(w1).clamp(min=0).mm(w2)
108 |         loss = (y_pred - y).pow(2).sum() * loss_scaler.loss_scale
109 |         print('Iter {} loss scale: {}'.format(t, loss_scaler.loss_scale))
110 |         print('Iter {} scaled loss: {}'.format(t, loss.data[0]))
111 |         print('Iter {} unscaled loss: {}'.format(t, loss.data[0] / loss_scaler.loss_scale))
112 | 
113 |         # Run backprop
114 |         optimizer.zero_grad()
115 |         loss.backward()
116 | 
117 |         # Check for overflow
118 |         has_overflow = DynamicLossScaler.has_overflow(parameters)
119 | 
120 |         # If no overflow, unscale grad and update as usual
121 |         if not has_overflow:
122 |             for param in parameters:
123 |                 param.grad.data.mul_(1. / loss_scaler.loss_scale)
124 |             optimizer.step()
125 |         # Otherwise, don't do anything -- ie, skip iteration
126 |         else:
127 |             print('OVERFLOW!')
128 | 
129 |         # Update loss scale for next iteration
130 |         loss_scaler.update_scale(has_overflow)
131 | 
132 | 


--------------------------------------------------------------------------------
/mellotron_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NVIDIA/mellotron/d5362ccae23984f323e3cb024a01ec1de0493aff/mellotron_logo.png


--------------------------------------------------------------------------------
/mellotron_utils.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import numpy as np
  3 | import music21 as m21
  4 | import torch
  5 | import torch.nn.functional as F
  6 | from text import text_to_sequence, get_arpabet, cmudict
  7 | 
  8 | 
  9 | CMUDICT_PATH = "data/cmu_dictionary"
 10 | CMUDICT = cmudict.CMUDict(CMUDICT_PATH)
 11 | PHONEME2GRAPHEME = {
 12 |     'AA': ['a', 'o', 'ah'],
 13 |     'AE': ['a', 'e'],
 14 |     'AH': ['u', 'e', 'a', 'h', 'o'],
 15 |     'AO': ['o', 'u', 'au'],
 16 |     'AW': ['ou', 'ow'],
 17 |     'AX': ['a'],
 18 |     'AXR': ['er'],
 19 |     'AY': ['i'],
 20 |     'EH': ['e', 'ae'],
 21 |     'EY': ['a', 'ai', 'ei', 'e', 'y'],
 22 |     'IH': ['i', 'e', 'y'],
 23 |     'IX': ['e', 'i'],
 24 |     'IY': ['ea', 'ey', 'y', 'i'],
 25 |     'OW': ['oa', 'o'],
 26 |     'OY': ['oy'],
 27 |     'UH': ['oo'],
 28 |     'UW': ['oo', 'u', 'o'],
 29 |     'UX': ['u'],
 30 |     'B': ['b'],
 31 |     'CH': ['ch', 'tch'],
 32 |     'D': ['d', 'e', 'de'],
 33 |     'DH': ['th'],
 34 |     'DX': ['tt'],
 35 |     'EL': ['le'],
 36 |     'EM': ['m'],
 37 |     'EN': ['on'],
 38 |     'ER': ['i', 'er'],
 39 |     'F': ['f'],
 40 |     'G': ['g'],
 41 |     'HH': ['h'],
 42 |     'JH': ['j'],
 43 |     'K': ['k', 'c', 'ch'],
 44 |     'KS': ['x'],
 45 |     'L': ['ll', 'l'],
 46 |     'M': ['m'],
 47 |     'N': ['n', 'gn'],
 48 |     'NG': ['ng'],
 49 |     'NX': ['nn'],
 50 |     'P': ['p'],
 51 |     'Q': ['-'],
 52 |     'R': ['wr', 'r'],
 53 |     'S': ['s', 'ce'],
 54 |     'SH': ['sh'],
 55 |     'T': ['t'],
 56 |     'TH': ['th'],
 57 |     'V': ['v', 'f', 'e'],
 58 |     'W': ['w'],
 59 |     'WH': ['wh'],
 60 |     'Y': ['y', 'j'],
 61 |     'Z': ['z', 's'],
 62 |     'ZH': ['s']
 63 | }
 64 | 
 65 | ########################
 66 | #  CONSONANT DURATION  #
 67 | ########################
 68 | PHONEMEDURATION = {
 69 |     'B': 0.05,
 70 |     'CH': 0.1,
 71 |     'D': 0.075,
 72 |     'DH': 0.05,
 73 |     'DX': 0.05,
 74 |     'EL': 0.05,
 75 |     'EM': 0.05,
 76 |     'EN': 0.05,
 77 |     'F': 0.1,
 78 |     'G': 0.05,
 79 |     'HH': 0.05,
 80 |     'JH': 0.05,
 81 |     'K': 0.05,
 82 |     'L': 0.05,
 83 |     'M': 0.15,
 84 |     'N': 0.15,
 85 |     'NG': 0.15,
 86 |     'NX': 0.05,
 87 |     'P': 0.05,
 88 |     'Q': 0.075,
 89 |     'R': 0.05,
 90 |     'S': 0.1,
 91 |     'SH': 0.05,
 92 |     'T': 0.075,
 93 |     'TH': 0.1,
 94 |     'V': 0.05,
 95 |     'Y': 0.05,
 96 |     'W': 0.05,
 97 |     'WH': 0.05,
 98 |     'Z': 0.05,
 99 |     'ZH': 0.05
100 | }
101 | 
102 | 
103 | def add_space_between_events(events, connect=False):
104 |     new_events = []
105 |     for i in range(1, len(events)):
106 |         token_a, freq_a, start_time_a, end_time_a = events[i-1][-1]
107 |         token_b, freq_b, start_time_b, end_time_b = events[i][0]
108 | 
109 |         if token_a in (' ', '') and len(events[i-1]) == 1:
110 |             new_events.append(events[i-1])
111 |         elif token_a not in (' ', '') and token_b not in (' ', ''):
112 |             new_events.append(events[i-1])
113 |             if connect:
114 |                 new_events.append([[' ', 0, end_time_a, start_time_b]])
115 |             else:
116 |                 new_events.append([[' ', 0, end_time_a, end_time_a]])
117 |         else:
118 |             new_events.append(events[i-1])
119 | 
120 |     if new_events[-1][0][0] != ' ':
121 |         new_events.append([[' ', 0, end_time_a, end_time_a]])
122 |     new_events.append(events[-1])
123 | 
124 |     return new_events
125 | 
126 | 
127 | def adjust_words(events):
128 |     new_events = []
129 |     for event in events:
130 |         if len(event) == 1 and event[0][0] == ' ':
131 |             new_events.append(event)
132 |         else:
133 |             for e in event:
134 |                 if e[0][0].isupper():
135 |                     new_events.append([e])
136 |                 else:
137 |                     new_events[-1].extend([e])
138 |     return new_events
139 | 
140 | 
141 | def adjust_extensions(events, phoneme_durations):
142 |     if len(events) == 1:
143 |         return events
144 | 
145 |     idx_last_vowel = None
146 |     n_consonants_after_last_vowel = 0
147 |     target_ids = np.arange(len(events))
148 |     for i in range(len(events)):
149 |         token = re.sub('[0-9{}]', '', events[i][0])
150 |         if idx_last_vowel is None and token not in phoneme_durations:
151 |             idx_last_vowel = i
152 |             n_consonants_after_last_vowel = 0
153 |         else:
154 |             if token == '_' and not n_consonants_after_last_vowel:
155 |                 events[i][0] = events[idx_last_vowel][0]
156 |             elif token == '_' and n_consonants_after_last_vowel:
157 |                 events[i][0] = events[idx_last_vowel][0]
158 |                 start = idx_last_vowel + 1
159 |                 target_ids[start:start+n_consonants_after_last_vowel] += 1
160 |                 target_ids[i] -= n_consonants_after_last_vowel
161 |             elif token in phoneme_durations:
162 |                 n_consonants_after_last_vowel += 1
163 |             else:
164 |                 n_consonants_after_last_vowel = 0
165 |                 idx_last_vowel = i
166 | 
167 |     new_events = [0] * len(events)
168 |     for i in range(len(events)):
169 |         new_events[target_ids[i]] = events[i]
170 | 
171 |     # adjust time of consonants that were repositioned
172 |     for i in range(1, len(new_events)):
173 |         if new_events[i][2] < new_events[i-1][2]:
174 |             new_events[i][2] = new_events[i-1][2]
175 |             new_events[i][3] = new_events[i-1][3]
176 | 
177 |     return new_events
178 | 
179 | 
180 | def adjust_consonant_lengths(events, phoneme_durations):
181 |     t_init = events[0][2]
182 | 
183 |     idx_last_vowel = None
184 |     for i in range(len(events)):
185 |         task = re.sub('[0-9{}]', '', events[i][0])
186 |         if task in phoneme_durations:
187 |             duration = phoneme_durations[task]
188 |             if idx_last_vowel is None:  # consonant comes before any vowel
189 |                 events[i][2] = t_init
190 |                 events[i][3] = t_init + duration
191 |             else:  # consonant comes after a vowel, must offset
192 |                 events[idx_last_vowel][3] -= duration
193 |                 for k in range(idx_last_vowel+1, i):
194 |                     events[k][2] -= duration
195 |                     events[k][3] -= duration
196 |                 events[i][2] = events[i-1][3]
197 |                 events[i][3] = events[i-1][3] + duration
198 |         else:
199 |             events[i][2] = t_init
200 |             events[i][3] = events[i][3]
201 |             t_init = events[i][3]
202 |             idx_last_vowel = i
203 |         t_init = events[i][3]
204 | 
205 |     return events
206 | 
207 | 
208 | def adjust_consonants(events, phoneme_durations):
209 |     if len(events) == 1:
210 |         return events
211 | 
212 |     start = 0
213 |     split_ids = []
214 |     t_init = events[0][2]
215 | 
216 |     # get each substring group
217 |     for i in range(1, len(events)):
218 |         if events[i][2] != t_init:
219 |             split_ids.append((start, i))
220 |             start = i
221 |             t_init = events[i][2]
222 |     split_ids.append((start, len(events)))
223 | 
224 |     for (start, end) in split_ids:
225 |         events[start:end] = adjust_consonant_lengths(
226 |             events[start:end], phoneme_durations)
227 | 
228 |     return events
229 | 
230 | 
231 | def adjust_event(event, hop_length=256, sampling_rate=22050):
232 |     tokens, freq, start_time, end_time = event
233 | 
234 |     if tokens == ' ':
235 |         return [event] if freq == 0 else [['_', freq, start_time, end_time]]
236 | 
237 |     return [[token, freq, start_time, end_time] for token in tokens]
238 | 
239 | 
240 | def musicxml2score(filepath, bpm=60):
241 |     track = {}
242 |     beat_length_seconds = 60/bpm
243 |     data = m21.converter.parse(filepath)
244 |     for i in range(len(data.parts)):
245 |         part = data.parts[i].flat
246 |         events = []
247 |         for k in range(len(part.notesAndRests)):
248 |             event = part.notesAndRests[k]
249 |             if isinstance(event, m21.note.Note):
250 |                 freq = event.pitch.frequency
251 |                 token = event.lyrics[0].text if len(event.lyrics) > 0 else ' '
252 |                 start_time = event.offset * beat_length_seconds
253 |                 end_time = start_time + event.duration.quarterLength * beat_length_seconds
254 |                 event = [token, freq, start_time, end_time]
255 |             elif isinstance(event, m21.note.Rest):
256 |                 freq = 0
257 |                 token = ' '
258 |                 start_time = event.offset * beat_length_seconds
259 |                 end_time = start_time + event.duration.quarterLength * beat_length_seconds
260 |                 event = [token, freq, start_time, end_time]
261 | 
262 |             if token == '_':
263 |                 raise Exception("Unexpected token {}".format(token))
264 | 
265 |             if len(events) == 0:
266 |                 events.append(event)
267 |             else:
268 |                 if token == ' ':
269 |                     if freq == 0:
270 |                         if events[-1][1] == 0:
271 |                             events[-1][3] = end_time
272 |                         else:
273 |                             events.append(event)
274 |                     elif freq == events[-1][1]:  # is event duration extension ?
275 |                         events[-1][-1] = end_time
276 |                     else:  # must be different note on same syllable
277 |                         events.append(event)
278 |                 else:
279 |                     events.append(event)
280 |         track[part.partName] = events
281 |     return track
282 | 
283 | 
284 | def track2events(track):
285 |     events = []
286 |     for e in track:
287 |         events.extend(adjust_event(e))
288 |     group_ids = [i for i in range(len(events))
289 |                  if events[i][0] in [' '] or events[i][0].isupper()]
290 | 
291 |     events_grouped = []
292 |     for i in range(1, len(group_ids)):
293 |         start, end = group_ids[i-1], group_ids[i]
294 |         events_grouped.append(events[start:end])
295 | 
296 |     if events[-1][0] != ' ':
297 |         events_grouped.append(events[group_ids[-1]:])
298 | 
299 |     return events_grouped
300 | 
301 | 
302 | def events2eventsarpabet(event):
303 |     if event[0][0] == ' ':
304 |         return event
305 | 
306 |     # get word and word arpabet
307 |     word = ''.join([e[0] for e in event if e[0] not in('_', ' ')])
308 |     word_arpabet = get_arpabet(word, CMUDICT)
309 |     if word_arpabet[0] != '{':
310 |         return event
311 | 
312 |     word_arpabet = word_arpabet.split()
313 | 
314 |     # align tokens to arpabet
315 |     i, k = 0, 0
316 |     new_events = []
317 |     while i < len(event) and k < len(word_arpabet):
318 |         # single token
319 |         token_a, freq_a, start_time_a, end_time_a = event[i]
320 | 
321 |         if token_a == ' ':
322 |             new_events.append([token_a, freq_a, start_time_a, end_time_a])
323 |             i += 1
324 |             continue
325 | 
326 |         if token_a == '_':
327 |             new_events.append([token_a, freq_a, start_time_a, end_time_a])
328 |             i += 1
329 |             continue
330 | 
331 |         # two tokens
332 |         if i < len(event) - 1:
333 |             j = i + 1
334 |             token_b, freq_b, start_time_b, end_time_b = event[j]
335 |             between_events = []
336 |             while j < len(event) and event[j][0] == '_':
337 |                 between_events.append([token_b, freq_b, start_time_b, end_time_b])
338 |                 j += 1
339 |                 if j < len(event):
340 |                     token_b, freq_b, start_time_b, end_time_b = event[j]
341 | 
342 |             token_compound_2 = (token_a + token_b).lower()
343 | 
344 |         # single arpabet
345 |         arpabet = re.sub('[0-9{}]', '', word_arpabet[k])
346 | 
347 |         if k < len(word_arpabet) - 1:
348 |             arpabet_compound_2 = ''.join(word_arpabet[k:k+2])
349 |             arpabet_compound_2 = re.sub('[0-9{}]', '', arpabet_compound_2)
350 | 
351 |         if i < len(event) - 1 and token_compound_2 in PHONEME2GRAPHEME[arpabet]:
352 |             new_events.append([word_arpabet[k], freq_a, start_time_a, end_time_a])
353 |             if len(between_events):
354 |                 new_events.extend(between_events)
355 |             if start_time_a != start_time_b:
356 |                 new_events.append([word_arpabet[k], freq_b, start_time_b, end_time_b])
357 |             i += 2 + len(between_events)
358 |             k += 1
359 |         elif token_a.lower() in PHONEME2GRAPHEME[arpabet]:
360 |             new_events.append([word_arpabet[k], freq_a, start_time_a, end_time_a])
361 |             i += 1
362 |             k += 1
363 |         elif arpabet_compound_2 in PHONEME2GRAPHEME and token_a.lower() in PHONEME2GRAPHEME[arpabet_compound_2]:
364 |             new_events.append([word_arpabet[k], freq_a, start_time_a, end_time_a])
365 |             new_events.append([word_arpabet[k+1], freq_a, start_time_a, end_time_a])
366 |             i += 1
367 |             k += 2
368 |         else:
369 |             k += 1
370 | 
371 |     # add extensions and pauses at end of words
372 |     while i < len(event):
373 |         token_a, freq_a, start_time_a, end_time_a = event[i]
374 | 
375 |         if token_a in (' ', '_'):
376 |             new_events.append([token_a, freq_a, start_time_a, end_time_a])
377 |         i += 1
378 | 
379 |     return new_events
380 | 
381 | 
382 | def event2alignment(events, hop_length=256, sampling_rate=22050):
383 |     frame_length = float(hop_length) / float(sampling_rate)
384 | 
385 |     n_frames = int(events[-1][-1][-1] / frame_length)
386 |     n_tokens = np.sum([len(e) for e in events])
387 |     alignment = np.zeros((n_tokens, n_frames))
388 | 
389 |     cur_event = -1
390 |     for event in events:
391 |         for i in range(len(event)):
392 |             if len(event) == 1 or cur_event == -1 or event[i][0] != event[i-1][0]:
393 |                 cur_event += 1
394 |             token, freq, start_time, end_time = event[i]
395 |             alignment[cur_event, int(start_time/frame_length):int(end_time/frame_length)] = 1
396 | 
397 |     return alignment[:cur_event+1]
398 | 
399 | 
400 | def event2f0(events, hop_length=256, sampling_rate=22050):
401 |     frame_length = float(hop_length) / float(sampling_rate)
402 |     n_frames = int(events[-1][-1][-1] / frame_length)
403 |     f0s = np.zeros((1, n_frames))
404 | 
405 |     for event in events:
406 |         for i in range(len(event)):
407 |             token, freq, start_time, end_time = event[i]
408 |             f0s[0, int(start_time/frame_length):int(end_time/frame_length)] = freq
409 | 
410 |     return f0s
411 | 
412 | 
413 | def event2text(events, convert_stress, cmudict=None):
414 |     text_clean = ''
415 |     for event in events:
416 |         for i in range(len(event)):
417 |             if i > 0 and event[i][0] == event[i-1][0]:
418 |                 continue
419 |             if event[i][0] == ' ' and len(event) > 1:
420 |                 if text_clean[-1] != "}":
421 |                     text_clean = text_clean[:-1] + '} {'
422 |                 else:
423 |                     text_clean += ' {'
424 |             else:
425 |                 if event[i][0][-1] in ('}', ' '):
426 |                     text_clean += event[i][0]
427 |                 else:
428 |                     text_clean += event[i][0] + ' '
429 | 
430 |     if convert_stress:
431 |         text_clean = re.sub('[0-9]', '1', text_clean)
432 | 
433 |     text_encoded = text_to_sequence(text_clean, [], cmudict)
434 |     return text_encoded, text_clean
435 | 
436 | 
437 | def remove_excess_frames(alignment, f0s):
438 |     excess_frames = np.sum(alignment.sum(0) == 0)
439 |     alignment = alignment[:, :-excess_frames] if excess_frames > 0 else alignment
440 |     f0s = f0s[:, :-excess_frames] if excess_frames > 0 else f0s
441 |     return alignment, f0s
442 | 
443 | 
444 | def get_data_from_musicxml(filepath, bpm, phoneme_durations=None,
445 |                            convert_stress=False):
446 |     if phoneme_durations is None:
447 |         phoneme_durations = PHONEMEDURATION
448 |     score = musicxml2score(filepath, bpm)
449 |     data = {}
450 |     for k, v in score.items():
451 |         # ignore empty tracks
452 |         if len(v) == 1 and v[0][0] == ' ':
453 |             continue
454 | 
455 |         events = track2events(v)
456 |         events = adjust_words(events)
457 |         events_arpabet = [events2eventsarpabet(e) for e in events]
458 | 
459 |         # make adjustments
460 |         events_arpabet = [adjust_extensions(e, phoneme_durations)
461 |                           for e in events_arpabet]
462 |         events_arpabet = [adjust_consonants(e, phoneme_durations)
463 |                           for e in events_arpabet]
464 |         events_arpabet = add_space_between_events(events_arpabet)
465 | 
466 |         # convert data to alignment, f0 and text encoded
467 |         alignment = event2alignment(events_arpabet)
468 |         f0s = event2f0(events_arpabet)
469 |         alignment, f0s = remove_excess_frames(alignment, f0s)
470 |         text_encoded, text_clean = event2text(events_arpabet, convert_stress)
471 | 
472 |         # convert data to torch
473 |         alignment = torch.from_numpy(alignment).permute(1, 0)[:, None].float()
474 |         f0s = torch.from_numpy(f0s)[None].float()
475 |         text_encoded = torch.LongTensor(text_encoded)[None]
476 |         data[k] = {'rhythm': alignment,
477 |                    'pitch_contour': f0s,
478 |                    'text_encoded': text_encoded}
479 | 
480 |     return data
481 | 
482 | 
483 | if __name__ == "__main__":
484 |     import argparse
485 |     # Get defaults so it can work with no Sacred
486 |     parser = argparse.ArgumentParser()
487 |     parser.add_argument('-f', "--filepath", required=True)
488 |     args = parser.parse_args()
489 |     get_data_from_musicxml(args.filepath, 60)
490 | 


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | from math import sqrt
  2 | import numpy as np
  3 | from numpy import finfo
  4 | import torch
  5 | from torch.autograd import Variable
  6 | from torch import nn
  7 | from torch.nn import functional as F
  8 | from layers import ConvNorm, LinearNorm
  9 | from utils import to_gpu, get_mask_from_lengths
 10 | from modules import GST
 11 | 
 12 | drop_rate = 0.5
 13 | 
 14 | def load_model(hparams):
 15 |     model = Tacotron2(hparams).cuda()
 16 |     if hparams.fp16_run:
 17 |         model.decoder.attention_layer.score_mask_value = finfo('float16').min
 18 | 
 19 |     return model
 20 | 
 21 | 
 22 | class LocationLayer(nn.Module):
 23 |     def __init__(self, attention_n_filters, attention_kernel_size,
 24 |                  attention_dim):
 25 |         super(LocationLayer, self).__init__()
 26 |         padding = int((attention_kernel_size - 1) / 2)
 27 |         self.location_conv = ConvNorm(2, attention_n_filters,
 28 |                                       kernel_size=attention_kernel_size,
 29 |                                       padding=padding, bias=False, stride=1,
 30 |                                       dilation=1)
 31 |         self.location_dense = LinearNorm(attention_n_filters, attention_dim,
 32 |                                          bias=False, w_init_gain='tanh')
 33 | 
 34 |     def forward(self, attention_weights_cat):
 35 |         processed_attention = self.location_conv(attention_weights_cat)
 36 |         processed_attention = processed_attention.transpose(1, 2)
 37 |         processed_attention = self.location_dense(processed_attention)
 38 |         return processed_attention
 39 | 
 40 | 
 41 | class Attention(nn.Module):
 42 |     def __init__(self, attention_rnn_dim, embedding_dim, attention_dim,
 43 |                  attention_location_n_filters, attention_location_kernel_size):
 44 |         super(Attention, self).__init__()
 45 |         self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
 46 |                                       bias=False, w_init_gain='tanh')
 47 |         self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
 48 |                                        w_init_gain='tanh')
 49 |         self.v = LinearNorm(attention_dim, 1, bias=False)
 50 |         self.location_layer = LocationLayer(attention_location_n_filters,
 51 |                                             attention_location_kernel_size,
 52 |                                             attention_dim)
 53 |         self.score_mask_value = -float("inf")
 54 | 
 55 |     def get_alignment_energies(self, query, processed_memory,
 56 |                                attention_weights_cat):
 57 |         """
 58 |         PARAMS
 59 |         ------
 60 |         query: decoder output (batch, n_mel_channels * n_frames_per_step)
 61 |         processed_memory: processed encoder outputs (B, T_in, attention_dim)
 62 |         attention_weights_cat: cumulative and prev. att weights (B, 2, max_time)
 63 | 
 64 |         RETURNS
 65 |         -------
 66 |         alignment (batch, max_time)
 67 |         """
 68 | 
 69 |         processed_query = self.query_layer(query.unsqueeze(1))
 70 |         processed_attention_weights = self.location_layer(attention_weights_cat)
 71 |         energies = self.v(torch.tanh(
 72 |             processed_query + processed_attention_weights + processed_memory))
 73 | 
 74 |         energies = energies.squeeze(-1)
 75 |         return energies
 76 | 
 77 |     def forward(self, attention_hidden_state, memory, processed_memory,
 78 |                 attention_weights_cat, mask, attention_weights=None):
 79 |         """
 80 |         PARAMS
 81 |         ------
 82 |         attention_hidden_state: attention rnn last output
 83 |         memory: encoder outputs
 84 |         processed_memory: processed encoder outputs
 85 |         attention_weights_cat: previous and cummulative attention weights
 86 |         mask: binary mask for padded data
 87 |         """
 88 |         if attention_weights is None:
 89 |             alignment = self.get_alignment_energies(
 90 |                 attention_hidden_state, processed_memory, attention_weights_cat)
 91 | 
 92 |             if mask is not None:
 93 |                 alignment.data.masked_fill_(mask, self.score_mask_value)
 94 | 
 95 |             attention_weights = F.softmax(alignment, dim=1)
 96 |         attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
 97 |         attention_context = attention_context.squeeze(1)
 98 | 
 99 |         return attention_context, attention_weights
100 | 
101 | 
102 | class Prenet(nn.Module):
103 |     def __init__(self, in_dim, sizes):
104 |         super(Prenet, self).__init__()
105 |         in_sizes = [in_dim] + sizes[:-1]
106 |         self.layers = nn.ModuleList(
107 |             [LinearNorm(in_size, out_size, bias=False)
108 |              for (in_size, out_size) in zip(in_sizes, sizes)])
109 | 
110 |     def forward(self, x):
111 |         for linear in self.layers:
112 |             x = F.dropout(F.relu(linear(x)), p=drop_rate, training=True)
113 |         return x
114 | 
115 | 
116 | class Postnet(nn.Module):
117 |     """Postnet
118 |         - Five 1-d convolution with 512 channels and kernel size 5
119 |     """
120 | 
121 |     def __init__(self, hparams):
122 |         super(Postnet, self).__init__()
123 |         self.convolutions = nn.ModuleList()
124 | 
125 |         self.convolutions.append(
126 |             nn.Sequential(
127 |                 ConvNorm(hparams.n_mel_channels, hparams.postnet_embedding_dim,
128 |                          kernel_size=hparams.postnet_kernel_size, stride=1,
129 |                          padding=int((hparams.postnet_kernel_size - 1) / 2),
130 |                          dilation=1, w_init_gain='tanh'),
131 |                 nn.BatchNorm1d(hparams.postnet_embedding_dim))
132 |         )
133 | 
134 |         for i in range(1, hparams.postnet_n_convolutions - 1):
135 |             self.convolutions.append(
136 |                 nn.Sequential(
137 |                     ConvNorm(hparams.postnet_embedding_dim,
138 |                              hparams.postnet_embedding_dim,
139 |                              kernel_size=hparams.postnet_kernel_size, stride=1,
140 |                              padding=int((hparams.postnet_kernel_size - 1) / 2),
141 |                              dilation=1, w_init_gain='tanh'),
142 |                     nn.BatchNorm1d(hparams.postnet_embedding_dim))
143 |             )
144 | 
145 |         self.convolutions.append(
146 |             nn.Sequential(
147 |                 ConvNorm(hparams.postnet_embedding_dim, hparams.n_mel_channels,
148 |                          kernel_size=hparams.postnet_kernel_size, stride=1,
149 |                          padding=int((hparams.postnet_kernel_size - 1) / 2),
150 |                          dilation=1, w_init_gain='linear'),
151 |                 nn.BatchNorm1d(hparams.n_mel_channels))
152 |             )
153 | 
154 |     def forward(self, x):
155 |         for i in range(len(self.convolutions) - 1):
156 |             x = F.dropout(torch.tanh(self.convolutions[i](x)), drop_rate, self.training)
157 |         x = F.dropout(self.convolutions[-1](x), drop_rate, self.training)
158 | 
159 |         return x
160 | 
161 | 
162 | class Encoder(nn.Module):
163 |     """Encoder module:
164 |         - Three 1-d convolution banks
165 |         - Bidirectional LSTM
166 |     """
167 |     def __init__(self, hparams):
168 |         super(Encoder, self).__init__()
169 | 
170 |         convolutions = []
171 |         for _ in range(hparams.encoder_n_convolutions):
172 |             conv_layer = nn.Sequential(
173 |                 ConvNorm(hparams.encoder_embedding_dim,
174 |                          hparams.encoder_embedding_dim,
175 |                          kernel_size=hparams.encoder_kernel_size, stride=1,
176 |                          padding=int((hparams.encoder_kernel_size - 1) / 2),
177 |                          dilation=1, w_init_gain='relu'),
178 |                 nn.BatchNorm1d(hparams.encoder_embedding_dim))
179 |             convolutions.append(conv_layer)
180 |         self.convolutions = nn.ModuleList(convolutions)
181 | 
182 |         self.lstm = nn.LSTM(hparams.encoder_embedding_dim,
183 |                             int(hparams.encoder_embedding_dim / 2), 1,
184 |                             batch_first=True, bidirectional=True)
185 | 
186 |     def forward(self, x, input_lengths):
187 |         if x.size()[0] > 1:
188 |             print("here")
189 |             x_embedded = []
190 |             for b_ind in range(x.size()[0]):  # TODO: Speed up
191 |                 curr_x = x[b_ind:b_ind+1, :, :input_lengths[b_ind]].clone()
192 |                 for conv in self.convolutions:
193 |                     curr_x = F.dropout(F.relu(conv(curr_x)), drop_rate, self.training)
194 |                 x_embedded.append(curr_x[0].transpose(0, 1))
195 |             x = torch.nn.utils.rnn.pad_sequence(x_embedded, batch_first=True)
196 |         else:
197 |             for conv in self.convolutions:
198 |                 x = F.dropout(F.relu(conv(x)), drop_rate, self.training)
199 |             x = x.transpose(1, 2)
200 | 
201 |         # pytorch tensor are not reversible, hence the conversion
202 |         input_lengths = input_lengths.cpu().numpy()
203 |         x = nn.utils.rnn.pack_padded_sequence(
204 |             x, input_lengths, batch_first=True)
205 | 
206 |         self.lstm.flatten_parameters()
207 |         outputs, _ = self.lstm(x)
208 | 
209 |         outputs, _ = nn.utils.rnn.pad_packed_sequence(
210 |             outputs, batch_first=True)
211 | 
212 |         return outputs
213 | 
214 |     def inference(self, x):
215 |         for conv in self.convolutions:
216 |             x = F.dropout(F.relu(conv(x)), drop_rate, self.training)
217 | 
218 |         x = x.transpose(1, 2)
219 | 
220 |         self.lstm.flatten_parameters()
221 |         outputs, _ = self.lstm(x)
222 | 
223 |         return outputs
224 | 
225 | 
226 | class Decoder(nn.Module):
227 |     def __init__(self, hparams):
228 |         super(Decoder, self).__init__()
229 |         self.n_mel_channels = hparams.n_mel_channels
230 |         self.n_frames_per_step = hparams.n_frames_per_step
231 |         self.encoder_embedding_dim = hparams.encoder_embedding_dim + hparams.token_embedding_size + hparams.speaker_embedding_dim
232 |         self.attention_rnn_dim = hparams.attention_rnn_dim
233 |         self.decoder_rnn_dim = hparams.decoder_rnn_dim
234 |         self.prenet_dim = hparams.prenet_dim
235 |         self.max_decoder_steps = hparams.max_decoder_steps
236 |         self.gate_threshold = hparams.gate_threshold
237 |         self.p_attention_dropout = hparams.p_attention_dropout
238 |         self.p_decoder_dropout = hparams.p_decoder_dropout
239 |         self.p_teacher_forcing = hparams.p_teacher_forcing
240 | 
241 |         self.prenet_f0 = ConvNorm(
242 |             1, hparams.prenet_f0_dim,
243 |             kernel_size=hparams.prenet_f0_kernel_size,
244 |             padding=max(0, int(hparams.prenet_f0_kernel_size/2)),
245 |             bias=False, stride=1, dilation=1)
246 | 
247 |         self.prenet = Prenet(
248 |             hparams.n_mel_channels * hparams.n_frames_per_step,
249 |             [hparams.prenet_dim, hparams.prenet_dim])
250 | 
251 |         self.attention_rnn = nn.LSTMCell(
252 |             hparams.prenet_dim + hparams.prenet_f0_dim + self.encoder_embedding_dim,
253 |             hparams.attention_rnn_dim)
254 | 
255 |         self.attention_layer = Attention(
256 |             hparams.attention_rnn_dim, self.encoder_embedding_dim,
257 |             hparams.attention_dim, hparams.attention_location_n_filters,
258 |             hparams.attention_location_kernel_size)
259 | 
260 |         self.decoder_rnn = nn.LSTMCell(
261 |             hparams.attention_rnn_dim + self.encoder_embedding_dim,
262 |             hparams.decoder_rnn_dim, 1)
263 | 
264 |         self.linear_projection = LinearNorm(
265 |             hparams.decoder_rnn_dim + self.encoder_embedding_dim,
266 |             hparams.n_mel_channels * hparams.n_frames_per_step)
267 | 
268 |         self.gate_layer = LinearNorm(
269 |             hparams.decoder_rnn_dim + self.encoder_embedding_dim, 1,
270 |             bias=True, w_init_gain='sigmoid')
271 | 
272 |     def get_go_frame(self, memory):
273 |         """ Gets all zeros frames to use as first decoder input
274 |         PARAMS
275 |         ------
276 |         memory: decoder outputs
277 | 
278 |         RETURNS
279 |         -------
280 |         decoder_input: all zeros frames
281 |         """
282 |         B = memory.size(0)
283 |         decoder_input = Variable(memory.data.new(
284 |             B, self.n_mel_channels * self.n_frames_per_step).zero_())
285 |         return decoder_input
286 | 
287 |     def get_end_f0(self, f0s):
288 |         B = f0s.size(0)
289 |         dummy = Variable(f0s.data.new(B, 1, f0s.size(1)).zero_())
290 |         return dummy
291 | 
292 |     def initialize_decoder_states(self, memory, mask):
293 |         """ Initializes attention rnn states, decoder rnn states, attention
294 |         weights, attention cumulative weights, attention context, stores memory
295 |         and stores processed memory
296 |         PARAMS
297 |         ------
298 |         memory: Encoder outputs
299 |         mask: Mask for padded data if training, expects None for inference
300 |         """
301 |         B = memory.size(0)
302 |         MAX_TIME = memory.size(1)
303 | 
304 |         self.attention_hidden = Variable(memory.data.new(
305 |             B, self.attention_rnn_dim).zero_())
306 |         self.attention_cell = Variable(memory.data.new(
307 |             B, self.attention_rnn_dim).zero_())
308 | 
309 |         self.decoder_hidden = Variable(memory.data.new(
310 |             B, self.decoder_rnn_dim).zero_())
311 |         self.decoder_cell = Variable(memory.data.new(
312 |             B, self.decoder_rnn_dim).zero_())
313 | 
314 |         self.attention_weights = Variable(memory.data.new(
315 |             B, MAX_TIME).zero_())
316 |         self.attention_weights_cum = Variable(memory.data.new(
317 |             B, MAX_TIME).zero_())
318 |         self.attention_context = Variable(memory.data.new(
319 |             B, self.encoder_embedding_dim).zero_())
320 | 
321 |         self.memory = memory
322 |         self.processed_memory = self.attention_layer.memory_layer(memory)
323 |         self.mask = mask
324 | 
325 |     def parse_decoder_inputs(self, decoder_inputs):
326 |         """ Prepares decoder inputs, i.e. mel outputs
327 |         PARAMS
328 |         ------
329 |         decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs
330 | 
331 |         RETURNS
332 |         -------
333 |         inputs: processed decoder inputs
334 | 
335 |         """
336 |         # (B, n_mel_channels, T_out) -> (B, T_out, n_mel_channels)
337 |         decoder_inputs = decoder_inputs.transpose(1, 2)
338 |         decoder_inputs = decoder_inputs.view(
339 |             decoder_inputs.size(0),
340 |             int(decoder_inputs.size(1)/self.n_frames_per_step), -1)
341 |         # (B, T_out, n_mel_channels) -> (T_out, B, n_mel_channels)
342 |         decoder_inputs = decoder_inputs.transpose(0, 1)
343 |         return decoder_inputs
344 | 
345 |     def parse_decoder_outputs(self, mel_outputs, gate_outputs, alignments):
346 |         """ Prepares decoder outputs for output
347 |         PARAMS
348 |         ------
349 |         mel_outputs:
350 |         gate_outputs: gate output energies
351 |         alignments:
352 | 
353 |         RETURNS
354 |         -------
355 |         mel_outputs:
356 |         gate_outpust: gate output energies
357 |         alignments:
358 |         """
359 |         # (T_out, B) -> (B, T_out)
360 |         alignments = torch.stack(alignments).transpose(0, 1)
361 |         # (T_out, B) -> (B, T_out)
362 |         gate_outputs = torch.stack(gate_outputs)
363 |         if len(gate_outputs.size()) > 1:
364 |             gate_outputs = gate_outputs.transpose(0, 1)
365 |         else:
366 |             gate_outputs = gate_outputs[None]
367 |         gate_outputs = gate_outputs.contiguous()
368 |         # (T_out, B, n_mel_channels) -> (B, T_out, n_mel_channels)
369 |         mel_outputs = torch.stack(mel_outputs).transpose(0, 1).contiguous()
370 |         # decouple frames per step
371 |         mel_outputs = mel_outputs.view(
372 |             mel_outputs.size(0), -1, self.n_mel_channels)
373 |         # (B, T_out, n_mel_channels) -> (B, n_mel_channels, T_out)
374 |         mel_outputs = mel_outputs.transpose(1, 2)
375 | 
376 |         return mel_outputs, gate_outputs, alignments
377 | 
378 |     def decode(self, decoder_input, attention_weights=None):
379 |         """ Decoder step using stored states, attention and memory
380 |         PARAMS
381 |         ------
382 |         decoder_input: previous mel output
383 | 
384 |         RETURNS
385 |         -------
386 |         mel_output:
387 |         gate_output: gate output energies
388 |         attention_weights:
389 |         """
390 |         cell_input = torch.cat((decoder_input, self.attention_context), -1)
391 |         self.attention_hidden, self.attention_cell = self.attention_rnn(
392 |             cell_input, (self.attention_hidden, self.attention_cell))
393 |         self.attention_hidden = F.dropout(
394 |             self.attention_hidden, self.p_attention_dropout, self.training)
395 |         self.attention_cell = F.dropout(
396 |             self.attention_cell, self.p_attention_dropout, self.training)
397 | 
398 |         attention_weights_cat = torch.cat(
399 |             (self.attention_weights.unsqueeze(1),
400 |              self.attention_weights_cum.unsqueeze(1)), dim=1)
401 |         self.attention_context, self.attention_weights = self.attention_layer(
402 |             self.attention_hidden, self.memory, self.processed_memory,
403 |             attention_weights_cat, self.mask, attention_weights)
404 | 
405 |         self.attention_weights_cum += self.attention_weights
406 |         decoder_input = torch.cat(
407 |             (self.attention_hidden, self.attention_context), -1)
408 |         self.decoder_hidden, self.decoder_cell = self.decoder_rnn(
409 |             decoder_input, (self.decoder_hidden, self.decoder_cell))
410 |         self.decoder_hidden = F.dropout(
411 |             self.decoder_hidden, self.p_decoder_dropout, self.training)
412 |         self.decoder_cell = F.dropout(
413 |             self.decoder_cell, self.p_decoder_dropout, self.training)
414 | 
415 |         decoder_hidden_attention_context = torch.cat(
416 |             (self.decoder_hidden, self.attention_context), dim=1)
417 | 
418 |         decoder_output = self.linear_projection(
419 |             decoder_hidden_attention_context)
420 | 
421 |         gate_prediction = self.gate_layer(decoder_hidden_attention_context)
422 |         return decoder_output, gate_prediction, self.attention_weights
423 | 
424 |     def forward(self, memory, decoder_inputs, memory_lengths, f0s):
425 |         """ Decoder forward pass for training
426 |         PARAMS
427 |         ------
428 |         memory: Encoder outputs
429 |         decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs
430 |         memory_lengths: Encoder output lengths for attention masking.
431 | 
432 |         RETURNS
433 |         -------
434 |         mel_outputs: mel outputs from the decoder
435 |         gate_outputs: gate outputs from the decoder
436 |         alignments: sequence of attention weights from the decoder
437 |         """
438 | 
439 |         decoder_input = self.get_go_frame(memory).unsqueeze(0)
440 |         decoder_inputs = self.parse_decoder_inputs(decoder_inputs)
441 |         decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0)
442 |         decoder_inputs = self.prenet(decoder_inputs)
443 | 
444 |         # audio features
445 |         f0_dummy = self.get_end_f0(f0s)
446 |         f0s = torch.cat((f0s, f0_dummy), dim=2)
447 |         f0s = F.relu(self.prenet_f0(f0s))
448 |         f0s = f0s.permute(2, 0, 1)
449 | 
450 |         self.initialize_decoder_states(
451 |             memory, mask=~get_mask_from_lengths(memory_lengths))
452 | 
453 |         mel_outputs, gate_outputs, alignments = [], [], []
454 |         while len(mel_outputs) < decoder_inputs.size(0) - 1:
455 |             if len(mel_outputs) == 0 or np.random.uniform(0.0, 1.0) <= self.p_teacher_forcing:
456 |                 decoder_input = torch.cat((decoder_inputs[len(mel_outputs)],
457 |                                            f0s[len(mel_outputs)]), dim=1)
458 |             else:
459 |                 decoder_input = torch.cat((self.prenet(mel_outputs[-1]),
460 |                                            f0s[len(mel_outputs)]), dim=1)
461 |             mel_output, gate_output, attention_weights = self.decode(
462 |                 decoder_input)
463 |             mel_outputs += [mel_output.squeeze(1)]
464 |             gate_outputs += [gate_output.squeeze()]
465 |             alignments += [attention_weights]
466 | 
467 |         mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
468 |             mel_outputs, gate_outputs, alignments)
469 | 
470 |         return mel_outputs, gate_outputs, alignments
471 | 
472 |     def inference(self, memory, f0s):
473 |         """ Decoder inference
474 |         PARAMS
475 |         ------
476 |         memory: Encoder outputs
477 | 
478 |         RETURNS
479 |         -------
480 |         mel_outputs: mel outputs from the decoder
481 |         gate_outputs: gate outputs from the decoder
482 |         alignments: sequence of attention weights from the decoder
483 |         """
484 |         decoder_input = self.get_go_frame(memory)
485 | 
486 |         self.initialize_decoder_states(memory, mask=None)
487 |         f0_dummy = self.get_end_f0(f0s)
488 |         f0s = torch.cat((f0s, f0_dummy), dim=2)
489 |         f0s = F.relu(self.prenet_f0(f0s))
490 |         f0s = f0s.permute(2, 0, 1)
491 | 
492 |         mel_outputs, gate_outputs, alignments = [], [], []
493 |         while True:
494 |             if len(mel_outputs) < len(f0s):
495 |                 f0 = f0s[len(mel_outputs)]
496 |             else:
497 |                 f0 = f0s[-1] * 0
498 | 
499 |             decoder_input = torch.cat((self.prenet(decoder_input), f0), dim=1)
500 |             mel_output, gate_output, alignment = self.decode(decoder_input)
501 | 
502 |             mel_outputs += [mel_output.squeeze(1)]
503 |             gate_outputs += [gate_output]
504 |             alignments += [alignment]
505 | 
506 |             if torch.sigmoid(gate_output.data) > self.gate_threshold:
507 |                 break
508 |             elif len(mel_outputs) == self.max_decoder_steps:
509 |                 print("Warning! Reached max decoder steps")
510 |                 break
511 | 
512 |             decoder_input = mel_output
513 | 
514 |         mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
515 |             mel_outputs, gate_outputs, alignments)
516 | 
517 |         return mel_outputs, gate_outputs, alignments
518 | 
519 |     def inference_noattention(self, memory, f0s, attention_map):
520 |         """ Decoder inference
521 |         PARAMS
522 |         ------
523 |         memory: Encoder outputs
524 | 
525 |         RETURNS
526 |         -------
527 |         mel_outputs: mel outputs from the decoder
528 |         gate_outputs: gate outputs from the decoder
529 |         alignments: sequence of attention weights from the decoder
530 |         """
531 |         decoder_input = self.get_go_frame(memory)
532 | 
533 |         self.initialize_decoder_states(memory, mask=None)
534 |         f0_dummy = self.get_end_f0(f0s)
535 |         f0s = torch.cat((f0s, f0_dummy), dim=2)
536 |         f0s = F.relu(self.prenet_f0(f0s))
537 |         f0s = f0s.permute(2, 0, 1)
538 | 
539 |         mel_outputs, gate_outputs, alignments = [], [], []
540 |         for i in range(len(attention_map)):
541 |             f0 = f0s[i]
542 |             attention = attention_map[i]
543 |             decoder_input = torch.cat((self.prenet(decoder_input), f0), dim=1)
544 |             mel_output, gate_output, alignment = self.decode(decoder_input, attention)
545 | 
546 |             mel_outputs += [mel_output.squeeze(1)]
547 |             gate_outputs += [gate_output]
548 |             alignments += [alignment]
549 | 
550 |             decoder_input = mel_output
551 | 
552 |         mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
553 |             mel_outputs, gate_outputs, alignments)
554 | 
555 |         return mel_outputs, gate_outputs, alignments
556 | 
557 | 
558 | class Tacotron2(nn.Module):
559 |     def __init__(self, hparams):
560 |         super(Tacotron2, self).__init__()
561 |         self.mask_padding = hparams.mask_padding
562 |         self.fp16_run = hparams.fp16_run
563 |         self.n_mel_channels = hparams.n_mel_channels
564 |         self.n_frames_per_step = hparams.n_frames_per_step
565 |         self.embedding = nn.Embedding(
566 |             hparams.n_symbols, hparams.symbols_embedding_dim)
567 |         std = sqrt(2.0 / (hparams.n_symbols + hparams.symbols_embedding_dim))
568 |         val = sqrt(3.0) * std  # uniform bounds for std
569 |         self.embedding.weight.data.uniform_(-val, val)
570 |         self.encoder = Encoder(hparams)
571 |         self.decoder = Decoder(hparams)
572 |         self.postnet = Postnet(hparams)
573 |         if hparams.with_gst:
574 |             self.gst = GST(hparams)
575 |         self.speaker_embedding = nn.Embedding(
576 |             hparams.n_speakers, hparams.speaker_embedding_dim)
577 | 
578 |     def parse_batch(self, batch):
579 |         text_padded, input_lengths, mel_padded, gate_padded, \
580 |             output_lengths, speaker_ids, f0_padded = batch
581 |         text_padded = to_gpu(text_padded).long()
582 |         input_lengths = to_gpu(input_lengths).long()
583 |         max_len = torch.max(input_lengths.data).item()
584 |         mel_padded = to_gpu(mel_padded).float()
585 |         gate_padded = to_gpu(gate_padded).float()
586 |         output_lengths = to_gpu(output_lengths).long()
587 |         speaker_ids = to_gpu(speaker_ids.data).long()
588 |         f0_padded = to_gpu(f0_padded).float()
589 |         return ((text_padded, input_lengths, mel_padded, max_len,
590 |                  output_lengths, speaker_ids, f0_padded),
591 |                 (mel_padded, gate_padded))
592 | 
593 |     def parse_output(self, outputs, output_lengths=None):
594 |         if self.mask_padding and output_lengths is not None:
595 |             mask = ~get_mask_from_lengths(output_lengths)
596 |             mask = mask.expand(self.n_mel_channels, mask.size(0), mask.size(1))
597 |             mask = mask.permute(1, 0, 2)
598 | 
599 |             outputs[0].data.masked_fill_(mask, 0.0)
600 |             outputs[1].data.masked_fill_(mask, 0.0)
601 |             outputs[2].data.masked_fill_(mask[:, 0, :], 1e3)  # gate energies
602 | 
603 |         return outputs
604 | 
605 |     def forward(self, inputs):
606 |         inputs, input_lengths, targets, max_len, \
607 |             output_lengths, speaker_ids, f0s = inputs
608 |         input_lengths, output_lengths = input_lengths.data, output_lengths.data
609 | 
610 |         embedded_inputs = self.embedding(inputs).transpose(1, 2)
611 |         embedded_text = self.encoder(embedded_inputs, input_lengths)
612 |         embedded_speakers = self.speaker_embedding(speaker_ids)[:, None]
613 |         embedded_gst = self.gst(targets, output_lengths)
614 |         embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1)
615 |         embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1)
616 | 
617 |         encoder_outputs = torch.cat(
618 |             (embedded_text, embedded_gst, embedded_speakers), dim=2)
619 | 
620 |         mel_outputs, gate_outputs, alignments = self.decoder(
621 |             encoder_outputs, targets, memory_lengths=input_lengths, f0s=f0s)
622 | 
623 |         mel_outputs_postnet = self.postnet(mel_outputs)
624 |         mel_outputs_postnet = mel_outputs + mel_outputs_postnet
625 | 
626 |         return self.parse_output(
627 |             [mel_outputs, mel_outputs_postnet, gate_outputs, alignments],
628 |             output_lengths)
629 | 
630 |     def inference(self, inputs):
631 |         text, style_input, speaker_ids, f0s = inputs
632 |         embedded_inputs = self.embedding(text).transpose(1, 2)
633 |         embedded_text = self.encoder.inference(embedded_inputs)
634 |         embedded_speakers = self.speaker_embedding(speaker_ids)[:, None]
635 |         if hasattr(self, 'gst'):
636 |             if isinstance(style_input, int):
637 |                 query = torch.zeros(1, 1, self.gst.encoder.ref_enc_gru_size).cuda()
638 |                 GST = torch.tanh(self.gst.stl.embed)
639 |                 key = GST[style_input].unsqueeze(0).expand(1, -1, -1)
640 |                 embedded_gst = self.gst.stl.attention(query, key)
641 |             else:
642 |                 embedded_gst = self.gst(style_input)
643 | 
644 |         embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1)
645 |         if hasattr(self, 'gst'):
646 |             embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1)
647 |             encoder_outputs = torch.cat(
648 |                 (embedded_text, embedded_gst, embedded_speakers), dim=2)
649 |         else:
650 |             encoder_outputs = torch.cat(
651 |                 (embedded_text, embedded_speakers), dim=2)
652 | 
653 |         mel_outputs, gate_outputs, alignments = self.decoder.inference(
654 |             encoder_outputs, f0s)
655 | 
656 |         mel_outputs_postnet = self.postnet(mel_outputs)
657 |         mel_outputs_postnet = mel_outputs + mel_outputs_postnet
658 | 
659 |         return self.parse_output(
660 |             [mel_outputs, mel_outputs_postnet, gate_outputs, alignments])
661 | 
662 |     def inference_noattention(self, inputs):
663 |         text, style_input, speaker_ids, f0s, attention_map = inputs
664 |         embedded_inputs = self.embedding(text).transpose(1, 2)
665 |         embedded_text = self.encoder.inference(embedded_inputs)
666 |         embedded_speakers = self.speaker_embedding(speaker_ids)[:, None]
667 |         if hasattr(self, 'gst'):
668 |             if isinstance(style_input, int):
669 |                 query = torch.zeros(1, 1, self.gst.encoder.ref_enc_gru_size).cuda()
670 |                 GST = torch.tanh(self.gst.stl.embed)
671 |                 key = GST[style_input].unsqueeze(0).expand(1, -1, -1)
672 |                 embedded_gst = self.gst.stl.attention(query, key)
673 |             else:
674 |                 embedded_gst = self.gst(style_input)
675 | 
676 |         embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1)
677 |         if hasattr(self, 'gst'):
678 |             embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1)
679 |             encoder_outputs = torch.cat(
680 |                 (embedded_text, embedded_gst, embedded_speakers), dim=2)
681 |         else:
682 |             encoder_outputs = torch.cat(
683 |                 (embedded_text, embedded_speakers), dim=2)
684 | 
685 |         mel_outputs, gate_outputs, alignments = self.decoder.inference_noattention(
686 |             encoder_outputs, f0s, attention_map)
687 | 
688 |         mel_outputs_postnet = self.postnet(mel_outputs)
689 |         mel_outputs_postnet = mel_outputs + mel_outputs_postnet
690 | 
691 |         return self.parse_output(
692 |             [mel_outputs, mel_outputs_postnet, gate_outputs, alignments])
693 | 


--------------------------------------------------------------------------------
/modules.py:
--------------------------------------------------------------------------------
  1 | # adapted from https://github.com/KinglittleQ/GST-Tacotron/blob/master/GST.py
  2 | # MIT License
  3 | #
  4 | # Copyright (c) 2018 MagicGirl Sakura
  5 | #
  6 | # Permission is hereby granted, free of charge, to any person obtaining a copy
  7 | # of this software and associated documentation files (the "Software"), to deal
  8 | # in the Software without restriction, including without limitation the rights
  9 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 10 | # copies of the Software, and to permit persons to whom the Software is
 11 | # furnished to do so, subject to the following conditions:
 12 | #
 13 | # The above copyright notice and this permission notice shall be included in
 14 | # all copies or substantial portions of the Software.
 15 | #
 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 17 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 18 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 19 | # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 20 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 21 | # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 22 | # DEALINGS IN THE SOFTWARE.
 23 | 
 24 | 
 25 | import torch
 26 | import torch.nn as nn
 27 | import torch.nn.init as init
 28 | import torch.nn.functional as F
 29 | 
 30 | 
 31 | class ReferenceEncoder(nn.Module):
 32 |     '''
 33 |     inputs --- [N, Ty/r, n_mels*r]  mels
 34 |     outputs --- [N, ref_enc_gru_size]
 35 |     '''
 36 | 
 37 |     def __init__(self, hp):
 38 | 
 39 |         super().__init__()
 40 |         K = len(hp.ref_enc_filters)
 41 |         filters = [1] + hp.ref_enc_filters
 42 | 
 43 |         convs = [nn.Conv2d(in_channels=filters[i],
 44 |                            out_channels=filters[i + 1],
 45 |                            kernel_size=(3, 3),
 46 |                            stride=(2, 2),
 47 |                            padding=(1, 1)) for i in range(K)]
 48 |         self.convs = nn.ModuleList(convs)
 49 |         self.bns = nn.ModuleList(
 50 |             [nn.BatchNorm2d(num_features=hp.ref_enc_filters[i])
 51 |              for i in range(K)])
 52 | 
 53 |         out_channels = self.calculate_channels(hp.n_mel_channels, 3, 2, 1, K)
 54 |         self.gru = nn.GRU(input_size=hp.ref_enc_filters[-1] * out_channels,
 55 |                           hidden_size=hp.ref_enc_gru_size,
 56 |                           batch_first=True)
 57 |         self.n_mel_channels = hp.n_mel_channels
 58 |         self.ref_enc_gru_size = hp.ref_enc_gru_size
 59 | 
 60 |     def forward(self, inputs, input_lengths=None):
 61 |         out = inputs.view(inputs.size(0), 1, -1, self.n_mel_channels)
 62 |         for conv, bn in zip(self.convs, self.bns):
 63 |             out = conv(out)
 64 |             out = bn(out)
 65 |             out = F.relu(out)
 66 | 
 67 |         out = out.transpose(1, 2)  # [N, Ty//2^K, 128, n_mels//2^K]
 68 |         N, T = out.size(0), out.size(1)
 69 |         out = out.contiguous().view(N, T, -1)  # [N, Ty//2^K, 128*n_mels//2^K]
 70 | 
 71 |         if input_lengths is not None:
 72 |             input_lengths = torch.ceil(input_lengths.float() / 2 ** len(self.convs))
 73 |             input_lengths = input_lengths.cpu().numpy().astype(int)            
 74 |             out = nn.utils.rnn.pack_padded_sequence(
 75 |                         out, input_lengths, batch_first=True, enforce_sorted=False)
 76 | 
 77 |         self.gru.flatten_parameters()
 78 |         _, out = self.gru(out)
 79 |         return out.squeeze(0)
 80 | 
 81 |     def calculate_channels(self, L, kernel_size, stride, pad, n_convs):
 82 |         for _ in range(n_convs):
 83 |             L = (L - kernel_size + 2 * pad) // stride + 1
 84 |         return L
 85 | 
 86 | 
 87 | class STL(nn.Module):
 88 |     '''
 89 |     inputs --- [N, token_embedding_size//2]
 90 |     '''
 91 |     def __init__(self, hp):
 92 |         super().__init__()
 93 |         self.embed = nn.Parameter(torch.FloatTensor(hp.token_num, hp.token_embedding_size // hp.num_heads))
 94 |         d_q = hp.ref_enc_gru_size
 95 |         d_k = hp.token_embedding_size // hp.num_heads
 96 |         self.attention = MultiHeadAttention(
 97 |             query_dim=d_q, key_dim=d_k, num_units=hp.token_embedding_size,
 98 |             num_heads=hp.num_heads)
 99 | 
100 |         init.normal_(self.embed, mean=0, std=0.5)
101 | 
102 |     def forward(self, inputs):
103 |         N = inputs.size(0)
104 |         query = inputs.unsqueeze(1)
105 |         keys = torch.tanh(self.embed).unsqueeze(0).expand(N, -1, -1)  # [N, token_num, token_embedding_size // num_heads]
106 |         style_embed = self.attention(query, keys)
107 | 
108 |         return style_embed
109 | 
110 | 
111 | class MultiHeadAttention(nn.Module):
112 |     '''
113 |     input:
114 |         query --- [N, T_q, query_dim]
115 |         key --- [N, T_k, key_dim]
116 |     output:
117 |         out --- [N, T_q, num_units]
118 |     '''
119 |     def __init__(self, query_dim, key_dim, num_units, num_heads):
120 |         super().__init__()
121 |         self.num_units = num_units
122 |         self.num_heads = num_heads
123 |         self.key_dim = key_dim
124 | 
125 |         self.W_query = nn.Linear(in_features=query_dim, out_features=num_units, bias=False)
126 |         self.W_key = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
127 |         self.W_value = nn.Linear(in_features=key_dim, out_features=num_units, bias=False)
128 | 
129 |     def forward(self, query, key):
130 |         querys = self.W_query(query)  # [N, T_q, num_units]
131 |         keys = self.W_key(key)  # [N, T_k, num_units]
132 |         values = self.W_value(key)
133 | 
134 |         split_size = self.num_units // self.num_heads
135 |         querys = torch.stack(torch.split(querys, split_size, dim=2), dim=0)  # [h, N, T_q, num_units/h]
136 |         keys = torch.stack(torch.split(keys, split_size, dim=2), dim=0)  # [h, N, T_k, num_units/h]
137 |         values = torch.stack(torch.split(values, split_size, dim=2), dim=0)  # [h, N, T_k, num_units/h]
138 | 
139 |         # score = softmax(QK^T / (d_k ** 0.5))
140 |         scores = torch.matmul(querys, keys.transpose(2, 3))  # [h, N, T_q, T_k]
141 |         scores = scores / (self.key_dim ** 0.5)
142 |         scores = F.softmax(scores, dim=3)
143 | 
144 |         # out = score * V
145 |         out = torch.matmul(scores, values)  # [h, N, T_q, num_units/h]
146 |         out = torch.cat(torch.split(out, 1, dim=0), dim=3).squeeze(0)  # [N, T_q, num_units]
147 | 
148 |         return out
149 | 
150 | 
151 | class GST(nn.Module):
152 |     def __init__(self, hp):
153 |         super().__init__()
154 |         self.encoder = ReferenceEncoder(hp)
155 |         self.stl = STL(hp)
156 | 
157 |     def forward(self, inputs, input_lengths=None):
158 |         enc_out = self.encoder(inputs, input_lengths=input_lengths)
159 |         style_embed = self.stl(enc_out)
160 | 
161 |         return style_embed
162 | 


--------------------------------------------------------------------------------
/multiproc.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import torch
 3 | import sys
 4 | import subprocess
 5 | 
 6 | argslist = list(sys.argv)[1:]
 7 | num_gpus = torch.cuda.device_count()
 8 | argslist.append('--n_gpus={}'.format(num_gpus))
 9 | workers = []
10 | job_id = time.strftime("%Y_%m_%d-%H%M%S")
11 | argslist.append("--group_name=group_{}".format(job_id))
12 | 
13 | for i in range(num_gpus):
14 |     argslist.append('--rank={}'.format(i))
15 |     stdout = None if i == 0 else open("logs/{}_GPU_{}.log".format(job_id, i),
16 |                                       "w")
17 |     print(argslist)
18 |     p = subprocess.Popen([str(sys.executable)]+argslist, stdout=stdout)
19 |     workers.append(p)
20 |     argslist = argslist[:-1]
21 | 
22 | for p in workers:
23 |     p.wait()
24 | 


--------------------------------------------------------------------------------
/plotting_utils.py:
--------------------------------------------------------------------------------
 1 | import matplotlib
 2 | matplotlib.use("Agg")
 3 | import matplotlib.pylab as plt
 4 | import numpy as np
 5 | 
 6 | 
 7 | def save_figure_to_numpy(fig):
 8 |     # save it to a numpy array.
 9 |     data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
10 |     data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
11 |     return data
12 | 
13 | 
14 | def plot_alignment_to_numpy(alignment, info=None):
15 |     fig, ax = plt.subplots(figsize=(6, 4))
16 |     im = ax.imshow(alignment, aspect='auto', origin='lower',
17 |                    interpolation='none')
18 |     fig.colorbar(im, ax=ax)
19 |     xlabel = 'Decoder timestep'
20 |     if info is not None:
21 |         xlabel += '\n\n' + info
22 |     plt.xlabel(xlabel)
23 |     plt.ylabel('Encoder timestep')
24 |     plt.tight_layout()
25 | 
26 |     fig.canvas.draw()
27 |     data = save_figure_to_numpy(fig)
28 |     plt.close()
29 |     return data
30 | 
31 | 
32 | def plot_spectrogram_to_numpy(spectrogram):
33 |     fig, ax = plt.subplots(figsize=(12, 3))
34 |     im = ax.imshow(spectrogram, aspect="auto", origin="lower",
35 |                    interpolation='none')
36 |     plt.colorbar(im, ax=ax)
37 |     plt.xlabel("Frames")
38 |     plt.ylabel("Channels")
39 |     plt.tight_layout()
40 | 
41 |     fig.canvas.draw()
42 |     data = save_figure_to_numpy(fig)
43 |     plt.close()
44 |     return data
45 | 
46 | 
47 | def plot_gate_outputs_to_numpy(gate_targets, gate_outputs):
48 |     fig, ax = plt.subplots(figsize=(12, 3))
49 |     ax.scatter(range(len(gate_targets)), gate_targets, alpha=0.5,
50 |                color='green', marker='+', s=1, label='target')
51 |     ax.scatter(range(len(gate_outputs)), gate_outputs, alpha=0.5,
52 |                color='red', marker='.', s=1, label='predicted')
53 | 
54 |     plt.xlabel("Frames (Green target, Red predicted)")
55 |     plt.ylabel("Gate State")
56 |     plt.tight_layout()
57 | 
58 |     fig.canvas.draw()
59 |     data = save_figure_to_numpy(fig)
60 |     plt.close()
61 |     return data
62 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | matplotlib==2.1.0
 2 | tensorflow==1.15.2
 3 | inflect==0.2.5
 4 | librosa==0.6.0
 5 | scipy==1.0.0
 6 | tensorboardX==1.1
 7 | Unidecode==1.0.22
 8 | pillow
 9 | nltk==3.4.5
10 | jamo==0.4.1
11 | music21
12 | 


--------------------------------------------------------------------------------
/stft.py:
--------------------------------------------------------------------------------
  1 | """
  2 | BSD 3-Clause License
  3 | 
  4 | Copyright (c) 2017, Prem Seetharaman
  5 | All rights reserved.
  6 | 
  7 | * Redistribution and use in source and binary forms, with or without
  8 |   modification, are permitted provided that the following conditions are met:
  9 | 
 10 | * Redistributions of source code must retain the above copyright notice,
 11 |   this list of conditions and the following disclaimer.
 12 | 
 13 | * Redistributions in binary form must reproduce the above copyright notice, this
 14 |   list of conditions and the following disclaimer in the
 15 |   documentation and/or other materials provided with the distribution.
 16 | 
 17 | * Neither the name of the copyright holder nor the names of its
 18 |   contributors may be used to endorse or promote products derived from this
 19 |   software without specific prior written permission.
 20 | 
 21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 22 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 23 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 24 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
 25 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 26 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 27 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
 28 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 29 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 30 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 31 | """
 32 | 
 33 | import torch
 34 | import numpy as np
 35 | import torch.nn.functional as F
 36 | from torch.autograd import Variable
 37 | from scipy.signal import get_window
 38 | from librosa.util import pad_center, tiny
 39 | from audio_processing import window_sumsquare
 40 | 
 41 | 
 42 | class STFT(torch.nn.Module):
 43 |     """adapted from Prem Seetharaman's https://github.com/pseeth/pytorch-stft"""
 44 |     def __init__(self, filter_length=800, hop_length=200, win_length=800,
 45 |                  window='hann'):
 46 |         super(STFT, self).__init__()
 47 |         self.filter_length = filter_length
 48 |         self.hop_length = hop_length
 49 |         self.win_length = win_length
 50 |         self.window = window
 51 |         self.forward_transform = None
 52 |         scale = self.filter_length / self.hop_length
 53 |         fourier_basis = np.fft.fft(np.eye(self.filter_length))
 54 | 
 55 |         cutoff = int((self.filter_length / 2 + 1))
 56 |         fourier_basis = np.vstack([np.real(fourier_basis[:cutoff, :]),
 57 |                                    np.imag(fourier_basis[:cutoff, :])])
 58 | 
 59 |         forward_basis = torch.FloatTensor(fourier_basis[:, None, :])
 60 |         inverse_basis = torch.FloatTensor(
 61 |             np.linalg.pinv(scale * fourier_basis).T[:, None, :])
 62 | 
 63 |         if window is not None:
 64 |             assert(filter_length >= win_length)
 65 |             # get window and zero center pad it to filter_length
 66 |             fft_window = get_window(window, win_length, fftbins=True)
 67 |             fft_window = pad_center(fft_window, filter_length)
 68 |             fft_window = torch.from_numpy(fft_window).float()
 69 | 
 70 |             # window the bases
 71 |             forward_basis *= fft_window
 72 |             inverse_basis *= fft_window
 73 | 
 74 |         self.register_buffer('forward_basis', forward_basis.float())
 75 |         self.register_buffer('inverse_basis', inverse_basis.float())
 76 | 
 77 |     def transform(self, input_data):
 78 |         num_batches = input_data.size(0)
 79 |         num_samples = input_data.size(1)
 80 | 
 81 |         self.num_samples = num_samples
 82 | 
 83 |         # similar to librosa, reflect-pad the input
 84 |         input_data = input_data.view(num_batches, 1, num_samples)
 85 |         input_data = F.pad(
 86 |             input_data.unsqueeze(1),
 87 |             (int(self.filter_length / 2), int(self.filter_length / 2), 0, 0),
 88 |             mode='reflect')
 89 |         input_data = input_data.squeeze(1)
 90 | 
 91 |         forward_transform = F.conv1d(
 92 |             input_data,
 93 |             Variable(self.forward_basis, requires_grad=False),
 94 |             stride=self.hop_length,
 95 |             padding=0)
 96 | 
 97 |         cutoff = int((self.filter_length / 2) + 1)
 98 |         real_part = forward_transform[:, :cutoff, :]
 99 |         imag_part = forward_transform[:, cutoff:, :]
100 | 
101 |         magnitude = torch.sqrt(real_part**2 + imag_part**2)
102 |         phase = torch.autograd.Variable(
103 |             torch.atan2(imag_part.data, real_part.data))
104 | 
105 |         return magnitude, phase
106 | 
107 |     def inverse(self, magnitude, phase):
108 |         recombine_magnitude_phase = torch.cat(
109 |             [magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1)
110 | 
111 |         inverse_transform = F.conv_transpose1d(
112 |             recombine_magnitude_phase,
113 |             Variable(self.inverse_basis, requires_grad=False),
114 |             stride=self.hop_length,
115 |             padding=0)
116 | 
117 |         if self.window is not None:
118 |             window_sum = window_sumsquare(
119 |                 self.window, magnitude.size(-1), hop_length=self.hop_length,
120 |                 win_length=self.win_length, n_fft=self.filter_length,
121 |                 dtype=np.float32)
122 |             # remove modulation effects
123 |             approx_nonzero_indices = torch.from_numpy(
124 |                 np.where(window_sum > tiny(window_sum))[0])
125 |             window_sum = torch.autograd.Variable(
126 |                 torch.from_numpy(window_sum), requires_grad=False)
127 |             window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum
128 |             inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices]
129 | 
130 |             # scale by hop ratio
131 |             inverse_transform *= float(self.filter_length) / self.hop_length
132 | 
133 |         inverse_transform = inverse_transform[:, :, int(self.filter_length/2):]
134 |         inverse_transform = inverse_transform[:, :, :-int(self.filter_length/2):]
135 | 
136 |         return inverse_transform
137 | 
138 |     def forward(self, input_data):
139 |         self.magnitude, self.phase = self.transform(input_data)
140 |         reconstruction = self.inverse(self.magnitude, self.phase)
141 |         return reconstruction
142 | 


--------------------------------------------------------------------------------
/text/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2017 Keith Ito
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/text/__init__.py:
--------------------------------------------------------------------------------
  1 | """ from https://github.com/keithito/tacotron """
  2 | import re
  3 | import random
  4 | from text import cleaners
  5 | from text.symbols import symbols
  6 | 
  7 | 
  8 | # Mappings from symbol to numeric ID and vice versa:
  9 | _symbol_to_id = {s: i for i, s in enumerate(symbols)}
 10 | _id_to_symbol = {i: s for i, s in enumerate(symbols)}
 11 | 
 12 | # Regular expression matching text enclosed in curly braces:
 13 | _curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')
 14 | _words_re = re.compile(r"([a-zA-ZÀ-ž]+['][a-zA-ZÀ-ž]{1,2}|[a-zA-ZÀ-ž]+)|([{][^}]+[}]|[^a-zA-ZÀ-ž{}]+)")
 15 | 
 16 | 
 17 | def get_arpabet(word, dictionary):
 18 |   word_arpabet = dictionary.lookup(word)
 19 |   if word_arpabet is not None:
 20 |     return "{" + word_arpabet[0] + "}"
 21 |   else:
 22 |     return word
 23 | 
 24 | 
 25 | def text_to_sequence(text, cleaner_names, dictionary=None, p_arpabet=1.0):
 26 |   '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
 27 | 
 28 |     The text can optionally have ARPAbet sequences enclosed in curly braces embedded
 29 |     in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
 30 | 
 31 |     Args:
 32 |       text: string to convert to a sequence
 33 |       cleaner_names: names of the cleaner functions to run the text through
 34 |       dictionary: arpabet class with arpabet dictionary
 35 | 
 36 |     Returns:
 37 |       List of integers corresponding to the symbols in the text
 38 |   '''
 39 |   sequence = []
 40 | 
 41 |   # Check for curly braces and treat their contents as ARPAbet:
 42 |   while len(text):
 43 |     m = _curly_re.match(text)
 44 |     if not m:
 45 |       clean_text = _clean_text(text, cleaner_names)
 46 |       if dictionary is not None:
 47 |         words = _words_re.findall(text)
 48 |         clean_text = [
 49 |           get_arpabet(word[0], dictionary)
 50 |           if ((word[0] != '') and random.random() < p_arpabet) else word[1]
 51 |           for word in words]
 52 | 
 53 |         for i in range(len(clean_text)):
 54 |             t = clean_text[i]
 55 |             if t.startswith("{"):
 56 |               sequence += _arpabet_to_sequence(t[1:-1])
 57 |             else:
 58 |               sequence +=  _symbols_to_sequence(t)
 59 |             #sequence += space
 60 |       else:
 61 |         sequence += _symbols_to_sequence(clean_text)
 62 |       break
 63 | 
 64 |     sequence += text_to_sequence(m.group(1), cleaner_names, dictionary, p_arpabet)
 65 |     sequence += _arpabet_to_sequence(m.group(2))
 66 |     text = m.group(3)
 67 | 
 68 |   return sequence
 69 | 
 70 | 
 71 | def sequence_to_text(sequence):
 72 |   '''Converts a sequence of IDs back to a string'''
 73 |   result = ''
 74 |   for symbol_id in sequence:
 75 |     if symbol_id in _id_to_symbol:
 76 |       s = _id_to_symbol[symbol_id]
 77 |       # Enclose ARPAbet back in curly braces:
 78 |       if len(s) > 1 and s[0] == '@':
 79 |         s = '{%s}' % s[1:]
 80 |       result += s
 81 |   return result.replace('}{', ' ')
 82 | 
 83 | 
 84 | def _clean_text(text, cleaner_names):
 85 |   for name in cleaner_names:
 86 |     cleaner = getattr(cleaners, name)
 87 |     if not cleaner:
 88 |       raise Exception('Unknown cleaner: %s' % name)
 89 |     text = cleaner(text)
 90 |   return text
 91 | 
 92 | 
 93 | def _symbols_to_sequence(symbols):
 94 |   return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
 95 | 
 96 | 
 97 | def _arpabet_to_sequence(text):
 98 |   return _symbols_to_sequence(['@' + s for s in text.split()])
 99 | 
100 | 
101 | def _should_keep_symbol(s):
102 |   return s in _symbol_to_id and s is not '_' and s is not '~'
103 | 
104 | 


--------------------------------------------------------------------------------
/text/cleaners.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | 
 3 | '''
 4 | Cleaners are transformations that run over the input text at both training and eval time.
 5 | 
 6 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
 7 | hyperparameter. Some cleaners are English-specific. You'll typically want to use:
 8 |   1. "english_cleaners" for English text
 9 |   2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
10 |      the Unidecode library (https://pypi.python.org/pypi/Unidecode)
11 |   3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
12 |      the symbols in symbols.py to match your data).
13 | '''
14 | 
15 | import re
16 | from unidecode import unidecode
17 | from .numbers import normalize_numbers
18 | 
19 | 
20 | # Regular expression matching whitespace:
21 | _whitespace_re = re.compile(r'\s+')
22 | 
23 | # List of (regular expression, replacement) pairs for abbreviations:
24 | _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
25 |   ('mrs', 'misess'),
26 |   ('mr', 'mister'),
27 |   ('dr', 'doctor'),
28 |   ('st', 'saint'),
29 |   ('co', 'company'),
30 |   ('jr', 'junior'),
31 |   ('maj', 'major'),
32 |   ('gen', 'general'),
33 |   ('drs', 'doctors'),
34 |   ('rev', 'reverend'),
35 |   ('lt', 'lieutenant'),
36 |   ('hon', 'honorable'),
37 |   ('sgt', 'sergeant'),
38 |   ('capt', 'captain'),
39 |   ('esq', 'esquire'),
40 |   ('ltd', 'limited'),
41 |   ('col', 'colonel'),
42 |   ('ft', 'fort'),
43 | ]]
44 | 
45 | 
46 | def expand_abbreviations(text):
47 |   for regex, replacement in _abbreviations:
48 |     text = re.sub(regex, replacement, text)
49 |   return text
50 | 
51 | 
52 | def expand_numbers(text):
53 |   return normalize_numbers(text)
54 | 
55 | 
56 | def lowercase(text):
57 |   return text.lower()
58 | 
59 | 
60 | def collapse_whitespace(text):
61 |   return re.sub(_whitespace_re, ' ', text)
62 | 
63 | 
64 | def convert_to_ascii(text):
65 |   return unidecode(text)
66 | 
67 | 
68 | def basic_cleaners(text):
69 |   '''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
70 |   text = lowercase(text)
71 |   text = collapse_whitespace(text)
72 |   return text
73 | 
74 | 
75 | def transliteration_cleaners(text):
76 |   '''Pipeline for non-English text that transliterates to ASCII.'''
77 |   text = convert_to_ascii(text)
78 |   text = lowercase(text)
79 |   text = collapse_whitespace(text)
80 |   return text
81 | 
82 | 
83 | def english_cleaners(text):
84 |   '''Pipeline for English text, including number and abbreviation expansion.'''
85 |   text = convert_to_ascii(text)
86 |   text = lowercase(text)
87 |   text = expand_numbers(text)
88 |   text = expand_abbreviations(text)
89 |   text = collapse_whitespace(text)
90 |   return text
91 | 


--------------------------------------------------------------------------------
/text/cmudict.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | 
 3 | import re
 4 | 
 5 | 
 6 | valid_symbols = [
 7 |   'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2',
 8 |   'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2',
 9 |   'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY',
10 |   'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1',
11 |   'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0',
12 |   'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW',
13 |   'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH'
14 | ]
15 | 
16 | _valid_symbol_set = set(valid_symbols)
17 | 
18 | 
19 | class CMUDict:
20 |   '''Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict'''
21 |   def __init__(self, file_or_path, keep_ambiguous=True):
22 |     if isinstance(file_or_path, str):
23 |       with open(file_or_path, encoding='latin-1') as f:
24 |         entries = _parse_cmudict(f)
25 |     else:
26 |       entries = _parse_cmudict(file_or_path)
27 |     if not keep_ambiguous:
28 |       entries = {word: pron for word, pron in entries.items() if len(pron) == 1}
29 |     self._entries = entries
30 | 
31 | 
32 |   def __len__(self):
33 |     return len(self._entries)
34 | 
35 | 
36 |   def lookup(self, word):
37 |     '''Returns list of ARPAbet pronunciations of the given word.'''
38 |     return self._entries.get(word.upper())
39 | 
40 | 
41 | 
42 | _alt_re = re.compile(r'\([0-9]+\)')
43 | 
44 | 
45 | def _parse_cmudict(file):
46 |   cmudict = {}
47 |   for line in file:
48 |     if len(line) and (line[0] >= 'A' and line[0] <= 'Z' or line[0] == "'"):
49 |       parts = line.split('  ')
50 |       word = re.sub(_alt_re, '', parts[0])
51 |       pronunciation = _get_pronunciation(parts[1])
52 |       if pronunciation:
53 |         if word in cmudict:
54 |           cmudict[word].append(pronunciation)
55 |         else:
56 |           cmudict[word] = [pronunciation]
57 |   return cmudict
58 | 
59 | 
60 | def _get_pronunciation(s):
61 |   parts = s.strip().split(' ')
62 |   for part in parts:
63 |     if part not in _valid_symbol_set:
64 |       return None
65 |   return ' '.join(parts)
66 | 


--------------------------------------------------------------------------------
/text/numbers.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | 
 3 | import inflect
 4 | import re
 5 | 
 6 | 
 7 | _inflect = inflect.engine()
 8 | _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
 9 | _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
10 | _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
11 | _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
12 | _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
13 | _number_re = re.compile(r'[0-9]+')
14 | 
15 | 
16 | def _remove_commas(m):
17 |   return m.group(1).replace(',', '')
18 | 
19 | 
20 | def _expand_decimal_point(m):
21 |   return m.group(1).replace('.', ' point ')
22 | 
23 | 
24 | def _expand_dollars(m):
25 |   match = m.group(1)
26 |   parts = match.split('.')
27 |   if len(parts) > 2:
28 |     return match + ' dollars'  # Unexpected format
29 |   dollars = int(parts[0]) if parts[0] else 0
30 |   cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
31 |   if dollars and cents:
32 |     dollar_unit = 'dollar' if dollars == 1 else 'dollars'
33 |     cent_unit = 'cent' if cents == 1 else 'cents'
34 |     return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
35 |   elif dollars:
36 |     dollar_unit = 'dollar' if dollars == 1 else 'dollars'
37 |     return '%s %s' % (dollars, dollar_unit)
38 |   elif cents:
39 |     cent_unit = 'cent' if cents == 1 else 'cents'
40 |     return '%s %s' % (cents, cent_unit)
41 |   else:
42 |     return 'zero dollars'
43 | 
44 | 
45 | def _expand_ordinal(m):
46 |   return _inflect.number_to_words(m.group(0))
47 | 
48 | 
49 | def _expand_number(m):
50 |   num = int(m.group(0))
51 |   if num > 1000 and num < 3000:
52 |     if num == 2000:
53 |       return 'two thousand'
54 |     elif num > 2000 and num < 2010:
55 |       return 'two thousand ' + _inflect.number_to_words(num % 100)
56 |     elif num % 100 == 0:
57 |       return _inflect.number_to_words(num // 100) + ' hundred'
58 |     else:
59 |       return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
60 |   else:
61 |     return _inflect.number_to_words(num, andword='')
62 | 
63 | 
64 | def normalize_numbers(text):
65 |   text = re.sub(_comma_number_re, _remove_commas, text)
66 |   text = re.sub(_pounds_re, r'\1 pounds', text)
67 |   text = re.sub(_dollars_re, _expand_dollars, text)
68 |   text = re.sub(_decimal_number_re, _expand_decimal_point, text)
69 |   text = re.sub(_ordinal_re, _expand_ordinal, text)
70 |   text = re.sub(_number_re, _expand_number, text)
71 |   return text
72 | 


--------------------------------------------------------------------------------
/text/symbols.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | 
 3 | '''
 4 | Defines the set of symbols used in text input to the model.
 5 | 
 6 | The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. '''
 7 | from text import cmudict
 8 | 
 9 | _punctuation = '!\'",.:;? '
10 | _math = '#%&*+-/[]()'
11 | _special = '_@©°½—₩€$'
12 | _accented = 'áçéêëñöøćž'
13 | _numbers = '0123456789'
14 | _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
15 | 
16 | # Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as
17 | # uppercase letters):
18 | _arpabet = ['@' + s for s in cmudict.valid_symbols]
19 | 
20 | # Export all symbols:
21 | symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet
22 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import argparse
  4 | import math
  5 | from numpy import finfo
  6 | 
  7 | import torch
  8 | from distributed import apply_gradient_allreduce
  9 | import torch.distributed as dist
 10 | from torch.utils.data.distributed import DistributedSampler
 11 | from torch.utils.data import DataLoader
 12 | 
 13 | from model import load_model
 14 | from data_utils import TextMelLoader, TextMelCollate
 15 | from loss_function import Tacotron2Loss
 16 | from logger import Tacotron2Logger
 17 | from hparams import create_hparams
 18 | 
 19 | 
 20 | def reduce_tensor(tensor, n_gpus):
 21 |     rt = tensor.clone()
 22 |     dist.all_reduce(rt, op=dist.ReduceOp.SUM)
 23 |     rt /= n_gpus
 24 |     return rt
 25 | 
 26 | 
 27 | def init_distributed(hparams, n_gpus, rank, group_name):
 28 |     assert torch.cuda.is_available(), "Distributed mode requires CUDA."
 29 |     print("Initializing Distributed")
 30 | 
 31 |     # Set cuda device so everything is done on the right GPU.
 32 |     torch.cuda.set_device(rank % torch.cuda.device_count())
 33 | 
 34 |     # Initialize distributed communication
 35 |     dist.init_process_group(
 36 |         backend=hparams.dist_backend, init_method=hparams.dist_url,
 37 |         world_size=n_gpus, rank=rank, group_name=group_name)
 38 | 
 39 |     print("Done initializing distributed")
 40 | 
 41 | 
 42 | def prepare_dataloaders(hparams):
 43 |     # Get data, data loaders and collate function ready
 44 |     trainset = TextMelLoader(hparams.training_files, hparams)
 45 |     valset = TextMelLoader(hparams.validation_files, hparams,
 46 |                            speaker_ids=trainset.speaker_ids)
 47 |     collate_fn = TextMelCollate(hparams.n_frames_per_step)
 48 | 
 49 |     if hparams.distributed_run:
 50 |         train_sampler = DistributedSampler(trainset)
 51 |         shuffle = False
 52 |     else:
 53 |         train_sampler = None
 54 |         shuffle = True
 55 | 
 56 |     train_loader = DataLoader(trainset, num_workers=1, shuffle=shuffle,
 57 |                               sampler=train_sampler,
 58 |                               batch_size=hparams.batch_size, pin_memory=False,
 59 |                               drop_last=True, collate_fn=collate_fn)
 60 |     return train_loader, valset, collate_fn, train_sampler
 61 | 
 62 | 
 63 | def prepare_directories_and_logger(output_directory, log_directory, rank):
 64 |     if rank == 0:
 65 |         if not os.path.isdir(output_directory):
 66 |             os.makedirs(output_directory)
 67 |             os.chmod(output_directory, 0o775)
 68 |         logger = Tacotron2Logger(os.path.join(output_directory, log_directory))
 69 |     else:
 70 |         logger = None
 71 |     return logger
 72 | 
 73 | 
 74 | def warm_start_model(checkpoint_path, model, ignore_layers):
 75 |     assert os.path.isfile(checkpoint_path)
 76 |     print("Warm starting model from checkpoint '{}'".format(checkpoint_path))
 77 |     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 78 |     model_dict = checkpoint_dict['state_dict']
 79 |     if len(ignore_layers) > 0:
 80 |         model_dict = {k: v for k, v in model_dict.items()
 81 |                       if k not in ignore_layers}
 82 |         dummy_dict = model.state_dict()
 83 |         dummy_dict.update(model_dict)
 84 |         model_dict = dummy_dict
 85 |     model.load_state_dict(model_dict)
 86 |     return model
 87 | 
 88 | 
 89 | def load_checkpoint(checkpoint_path, model, optimizer):
 90 |     assert os.path.isfile(checkpoint_path)
 91 |     print("Loading checkpoint '{}'".format(checkpoint_path))
 92 |     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 93 |     model.load_state_dict(checkpoint_dict['state_dict'])
 94 |     optimizer.load_state_dict(checkpoint_dict['optimizer'])
 95 |     learning_rate = checkpoint_dict['learning_rate']
 96 |     iteration = checkpoint_dict['iteration']
 97 |     print("Loaded checkpoint '{}' from iteration {}" .format(
 98 |         checkpoint_path, iteration))
 99 |     return model, optimizer, learning_rate, iteration
100 | 
101 | 
102 | def save_checkpoint(model, optimizer, learning_rate, iteration, filepath):
103 |     print("Saving model and optimizer state at iteration {} to {}".format(
104 |         iteration, filepath))
105 |     torch.save({'iteration': iteration,
106 |                 'state_dict': model.state_dict(),
107 |                 'optimizer': optimizer.state_dict(),
108 |                 'learning_rate': learning_rate}, filepath)
109 | 
110 | 
111 | def validate(model, criterion, valset, iteration, batch_size, n_gpus,
112 |              collate_fn, logger, distributed_run, rank):
113 |     """Handles all the validation scoring and printing"""
114 |     model.eval()
115 |     with torch.no_grad():
116 |         val_sampler = DistributedSampler(valset) if distributed_run else None
117 |         val_loader = DataLoader(valset, sampler=val_sampler, num_workers=1,
118 |                                 shuffle=False, batch_size=batch_size,
119 |                                 pin_memory=False, collate_fn=collate_fn)
120 | 
121 |         val_loss = 0.0
122 |         for i, batch in enumerate(val_loader):
123 |             x, y = model.parse_batch(batch)
124 |             y_pred = model(x)
125 |             loss = criterion(y_pred, y)
126 |             if distributed_run:
127 |                 reduced_val_loss = reduce_tensor(loss.data, n_gpus).item()
128 |             else:
129 |                 reduced_val_loss = loss.item()
130 |             val_loss += reduced_val_loss
131 |         val_loss = val_loss / (i + 1)
132 | 
133 |     model.train()
134 |     if rank == 0:
135 |         print("Validation loss {}: {:9f}  ".format(iteration, reduced_val_loss))
136 |         logger.log_validation(val_loss, model, y, y_pred, iteration)
137 | 
138 | 
139 | def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
140 |           rank, group_name, hparams):
141 |     """Training and validation logging results to tensorboard and stdout
142 | 
143 |     Params
144 |     ------
145 |     output_directory (string): directory to save checkpoints
146 |     log_directory (string) directory to save tensorboard logs
147 |     checkpoint_path(string): checkpoint path
148 |     n_gpus (int): number of gpus
149 |     rank (int): rank of current gpu
150 |     hparams (object): comma separated list of "name=value" pairs.
151 |     """
152 |     if hparams.distributed_run:
153 |         init_distributed(hparams, n_gpus, rank, group_name)
154 | 
155 |     torch.manual_seed(hparams.seed)
156 |     torch.cuda.manual_seed(hparams.seed)
157 | 
158 |     model = load_model(hparams)
159 |     learning_rate = hparams.learning_rate
160 |     optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,
161 |                                  weight_decay=hparams.weight_decay)
162 | 
163 |     if hparams.fp16_run:
164 |         from apex import amp
165 |         model, optimizer = amp.initialize(
166 |             model, optimizer, opt_level='O2')
167 | 
168 |     if hparams.distributed_run:
169 |         model = apply_gradient_allreduce(model)
170 | 
171 |     criterion = Tacotron2Loss()
172 | 
173 |     logger = prepare_directories_and_logger(
174 |         output_directory, log_directory, rank)
175 | 
176 |     train_loader, valset, collate_fn, train_sampler = prepare_dataloaders(hparams)
177 | 
178 |     # Load checkpoint if one exists
179 |     iteration = 0
180 |     epoch_offset = 0
181 |     if checkpoint_path is not None:
182 |         if warm_start:
183 |             model = warm_start_model(
184 |                 checkpoint_path, model, hparams.ignore_layers)
185 |         else:
186 |             model, optimizer, _learning_rate, iteration = load_checkpoint(
187 |                 checkpoint_path, model, optimizer)
188 |             if hparams.use_saved_learning_rate:
189 |                 learning_rate = _learning_rate
190 |             iteration += 1  # next iteration is iteration + 1
191 |             epoch_offset = max(0, int(iteration / len(train_loader)))
192 | 
193 |     model.train()
194 |     is_overflow = False
195 |     # ================ MAIN TRAINNIG LOOP! ===================
196 |     for epoch in range(epoch_offset, hparams.epochs):
197 |         print("Epoch: {}".format(epoch))
198 |         if train_sampler is not None:
199 |             train_sampler.set_epoch(epoch)
200 |         for i, batch in enumerate(train_loader):
201 |             start = time.perf_counter()
202 |             if iteration > 0 and iteration % hparams.learning_rate_anneal == 0:
203 |                 learning_rate = max(
204 |                     hparams.learning_rate_min, learning_rate * 0.5)
205 |                 for param_group in optimizer.param_groups:
206 |                     param_group['lr'] = learning_rate
207 | 
208 |             model.zero_grad()
209 |             x, y = model.parse_batch(batch)
210 |             y_pred = model(x)
211 | 
212 |             loss = criterion(y_pred, y)
213 |             if hparams.distributed_run:
214 |                 reduced_loss = reduce_tensor(loss.data, n_gpus).item()
215 |             else:
216 |                 reduced_loss = loss.item()
217 | 
218 |             if hparams.fp16_run:
219 |                 with amp.scale_loss(loss, optimizer) as scaled_loss:
220 |                     scaled_loss.backward()
221 |             else:
222 |                 loss.backward()
223 | 
224 |             if hparams.fp16_run:
225 |                 grad_norm = torch.nn.utils.clip_grad_norm_(
226 |                     amp.master_params(optimizer), hparams.grad_clip_thresh)
227 |                 is_overflow = math.isnan(grad_norm)
228 |             else:
229 |                 grad_norm = torch.nn.utils.clip_grad_norm_(
230 |                     model.parameters(), hparams.grad_clip_thresh)
231 | 
232 |             optimizer.step()
233 | 
234 |             if not is_overflow and rank == 0:
235 |                 duration = time.perf_counter() - start
236 |                 print("Train loss {} {:.6f} Grad Norm {:.6f} {:.2f}s/it".format(
237 |                     iteration, reduced_loss, grad_norm, duration))
238 |                 logger.log_training(
239 |                     reduced_loss, grad_norm, learning_rate, duration, iteration)
240 | 
241 |             if not is_overflow and (iteration % hparams.iters_per_checkpoint == 0):
242 |                 validate(model, criterion, valset, iteration,
243 |                         hparams.batch_size, n_gpus, collate_fn, logger,
244 |                         hparams.distributed_run, rank)
245 |                 if rank == 0:
246 |                     checkpoint_path = os.path.join(
247 |                         output_directory, "checkpoint_{}".format(iteration))
248 |                     save_checkpoint(model, optimizer, learning_rate, iteration,
249 |                                     checkpoint_path)
250 | 
251 |             iteration += 1
252 | 
253 | 
254 | if __name__ == '__main__':
255 |     parser = argparse.ArgumentParser()
256 |     parser.add_argument('-o', '--output_directory', type=str,
257 |                         help='directory to save checkpoints')
258 |     parser.add_argument('-l', '--log_directory', type=str,
259 |                         help='directory to save tensorboard logs')
260 |     parser.add_argument('-c', '--checkpoint_path', type=str, default=None,
261 |                         required=False, help='checkpoint path')
262 |     parser.add_argument('--warm_start', action='store_true',
263 |                         help='load model weights only, ignore specified layers')
264 |     parser.add_argument('--n_gpus', type=int, default=1,
265 |                         required=False, help='number of gpus')
266 |     parser.add_argument('--rank', type=int, default=0,
267 |                         required=False, help='rank of current gpu')
268 |     parser.add_argument('--group_name', type=str, default='group_name',
269 |                         required=False, help='Distributed group name')
270 |     parser.add_argument('--hparams', type=str,
271 |                         required=False, help='comma separated name=value pairs')
272 | 
273 |     args = parser.parse_args()
274 |     hparams = create_hparams(args.hparams)
275 | 
276 |     torch.backends.cudnn.enabled = hparams.cudnn_enabled
277 |     torch.backends.cudnn.benchmark = hparams.cudnn_benchmark
278 | 
279 |     print("FP16 Run:", hparams.fp16_run)
280 |     print("Dynamic Loss Scaling:", hparams.dynamic_loss_scaling)
281 |     print("Distributed Run:", hparams.distributed_run)
282 |     print("cuDNN Enabled:", hparams.cudnn_enabled)
283 |     print("cuDNN Benchmark:", hparams.cudnn_benchmark)
284 | 
285 |     train(args.output_directory, args.log_directory, args.checkpoint_path,
286 |           args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
287 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from scipy.io.wavfile import read
 3 | import torch
 4 | 
 5 | 
 6 | def get_mask_from_lengths(lengths):
 7 |     max_len = torch.max(lengths).item()
 8 |     ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len))
 9 |     mask = (ids < lengths.unsqueeze(1)).bool()
10 |     return mask
11 | 
12 | 
13 | def load_wav_to_torch(full_path):
14 |     sampling_rate, data = read(full_path)
15 |     return torch.FloatTensor(data.astype(np.float32)), sampling_rate
16 | 
17 | 
18 | def load_filepaths_and_text(filename, split="|"):
19 |     with open(filename, encoding='utf-8') as f:
20 |         filepaths_and_text = [line.strip().split(split) for line in f]
21 |     return filepaths_and_text
22 | 
23 | 
24 | def files_to_list(filename):
25 |     """
26 |     Takes a text file of filenames and makes a list of filenames
27 |     """
28 |     with open(filename, encoding='utf-8') as f:
29 |         files = f.readlines()
30 | 
31 |     files = [f.rstrip() for f in files]
32 |     return files
33 | 
34 | 
35 | def to_gpu(x):
36 |     x = x.contiguous()
37 | 
38 |     if torch.cuda.is_available():
39 |         x = x.cuda(non_blocking=True)
40 |     return torch.autograd.Variable(x)
41 | 


--------------------------------------------------------------------------------
/yin.py:
--------------------------------------------------------------------------------
  1 | # adapted from https://github.com/patriceguyot/Yin
  2 | 
  3 | import numpy as np
  4 | 
  5 | 
  6 | def differenceFunction(x, N, tau_max):
  7 |     """
  8 |     Compute difference function of data x. This corresponds to equation (6) in [1]
  9 |     This solution is implemented directly with Numpy fft.
 10 | 
 11 | 
 12 |     :param x: audio data
 13 |     :param N: length of data
 14 |     :param tau_max: integration window size
 15 |     :return: difference function
 16 |     :rtype: list
 17 |     """
 18 | 
 19 |     x = np.array(x, np.float64)
 20 |     w = x.size
 21 |     tau_max = min(tau_max, w)
 22 |     x_cumsum = np.concatenate((np.array([0.]), (x * x).cumsum()))
 23 |     size = w + tau_max
 24 |     p2 = (size // 32).bit_length()
 25 |     nice_numbers = (16, 18, 20, 24, 25, 27, 30, 32)
 26 |     size_pad = min(x * 2 ** p2 for x in nice_numbers if x * 2 ** p2 >= size)
 27 |     fc = np.fft.rfft(x, size_pad)
 28 |     conv = np.fft.irfft(fc * fc.conjugate())[:tau_max]
 29 |     return x_cumsum[w:w - tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2 * conv
 30 | 
 31 | 
 32 | def cumulativeMeanNormalizedDifferenceFunction(df, N):
 33 |     """
 34 |     Compute cumulative mean normalized difference function (CMND).
 35 | 
 36 |     This corresponds to equation (8) in [1]
 37 | 
 38 |     :param df: Difference function
 39 |     :param N: length of data
 40 |     :return: cumulative mean normalized difference function
 41 |     :rtype: list
 42 |     """
 43 | 
 44 |     cmndf = df[1:] * range(1, N) / np.cumsum(df[1:]).astype(float) #scipy method
 45 |     return np.insert(cmndf, 0, 1)
 46 | 
 47 | 
 48 | def getPitch(cmdf, tau_min, tau_max, harmo_th=0.1):
 49 |     """
 50 |     Return fundamental period of a frame based on CMND function.
 51 | 
 52 |     :param cmdf: Cumulative Mean Normalized Difference function
 53 |     :param tau_min: minimum period for speech
 54 |     :param tau_max: maximum period for speech
 55 |     :param harmo_th: harmonicity threshold to determine if it is necessary to compute pitch frequency
 56 |     :return: fundamental period if there is values under threshold, 0 otherwise
 57 |     :rtype: float
 58 |     """
 59 |     tau = tau_min
 60 |     while tau < tau_max:
 61 |         if cmdf[tau] < harmo_th:
 62 |             while tau + 1 < tau_max and cmdf[tau + 1] < cmdf[tau]:
 63 |                 tau += 1
 64 |             return tau
 65 |         tau += 1
 66 | 
 67 |     return 0    # if unvoiced
 68 | 
 69 | 
 70 | def compute_yin(sig, sr, w_len=512, w_step=256, f0_min=100, f0_max=500,
 71 |                 harmo_thresh=0.1):
 72 |     """
 73 | 
 74 |     Compute the Yin Algorithm. Return fundamental frequency and harmonic rate.
 75 | 
 76 |     :param sig: Audio signal (list of float)
 77 |     :param sr: sampling rate (int)
 78 |     :param w_len: size of the analysis window (samples)
 79 |     :param w_step: size of the lag between two consecutives windows (samples)
 80 |     :param f0_min: Minimum fundamental frequency that can be detected (hertz)
 81 |     :param f0_max: Maximum fundamental frequency that can be detected (hertz)
 82 |     :param harmo_tresh: Threshold of detection. The yalgorithmù return the first minimum of the CMND function below this treshold.
 83 | 
 84 |     :returns:
 85 | 
 86 |         * pitches: list of fundamental frequencies,
 87 |         * harmonic_rates: list of harmonic rate values for each fundamental frequency value (= confidence value)
 88 |         * argmins: minimums of the Cumulative Mean Normalized DifferenceFunction
 89 |         * times: list of time of each estimation
 90 |     :rtype: tuple
 91 |     """
 92 | 
 93 |     tau_min = int(sr / f0_max)
 94 |     tau_max = int(sr / f0_min)
 95 | 
 96 |     timeScale = range(0, len(sig) - w_len, w_step)  # time values for each analysis window
 97 |     times = [t/float(sr) for t in timeScale]
 98 |     frames = [sig[t:t + w_len] for t in timeScale]
 99 | 
100 |     pitches = [0.0] * len(timeScale)
101 |     harmonic_rates = [0.0] * len(timeScale)
102 |     argmins = [0.0] * len(timeScale)
103 | 
104 |     for i, frame in enumerate(frames):
105 |         # Compute YIN
106 |         df = differenceFunction(frame, w_len, tau_max)
107 |         cmdf = cumulativeMeanNormalizedDifferenceFunction(df, tau_max)
108 |         p = getPitch(cmdf, tau_min, tau_max, harmo_thresh)
109 | 
110 |         # Get results
111 |         if np.argmin(cmdf) > tau_min:
112 |             argmins[i] = float(sr / np.argmin(cmdf))
113 |         if p != 0:  # A pitch was found
114 |             pitches[i] = float(sr / p)
115 |             harmonic_rates[i] = cmdf[p]
116 |         else:  # No pitch, but we compute a value of the harmonic rate
117 |             harmonic_rates[i] = min(cmdf)
118 | 
119 |     return pitches, harmonic_rates, argmins, times
120 | 


--------------------------------------------------------------------------------