├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING
├── LICENSE
├── README.md
├── data.py
├── generate.py
├── img
    ├── attn_10.png
    ├── attn_14.png
    └── method.png
├── model.py
├── notebooks
    └── generate.ipynb
├── scripts
    ├── download_data.sh
    ├── download_models.sh
    ├── download_tools.sh
    └── requirements.txt
├── train.py
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pyc
 2 | *.pth
 3 | *.tar.gz
 4 | *.egg-info
 5 | models/**
 6 | data/**
 7 | tools/**
 8 | checkpoints/**
 9 | notebooks/.ipynb_checkpoints/**
10 | .ipynb_checkpoints/**
11 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Code of Conduct
2 | 
3 | Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please read the [full text](https://code.fb.com/codeofconduct/) so that you can understand what actions will and will not be tolerated.
4 | 


--------------------------------------------------------------------------------
/CONTRIBUTING:
--------------------------------------------------------------------------------
 1 | # Contributing to loop
 2 | We want to make contributing to this project as easy and transparent as
 3 | possible.
 4 | 
 5 | ## Pull Requests
 6 | We actively welcome your pull requests.
 7 | 
 8 | 1. Fork the repo and create your branch from `master`.
 9 | 2. If you've added code that should be tested, add tests.
10 | 3. If you've changed APIs, update the documentation.
11 | 4. Ensure the test suite passes.
12 | 5. Make sure your code lints.
13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA").
14 | 
15 | ## Contributor License Agreement ("CLA")
16 | In order to accept your pull request, we need you to submit a CLA. You only need
17 | to do this once to work on any of Facebook's open source projects.
18 | 
19 | Complete your CLA here: <https://code.facebook.com/cla>
20 | 
21 | ## Issues
22 | We use GitHub issues to track public bugs. Please ensure your description is
23 | clear and has sufficient instructions to be able to reproduce the issue.
24 | 
25 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
26 | disclosure of security bugs. In those cases, please go through the process
27 | outlined on that page and do not file a public issue.
28 | 
29 | ## Coding Style  
30 | * 2 spaces for indentation rather than tabs
31 | * 80 character line length
32 | 
33 | ## License
34 | By contributing to loop, you agree that your contributions will be licensed
35 | under the LICENSE file in the root directory of this source tree.
36 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Attribution-NonCommercial 4.0 International
  2 | 
  3 | =======================================================================
  4 | 
  5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  6 | does not provide legal services or legal advice. Distribution of
  7 | Creative Commons public licenses does not create a lawyer-client or
  8 | other relationship. Creative Commons makes its licenses and related
  9 | information available on an "as-is" basis. Creative Commons gives no
 10 | warranties regarding its licenses, any material licensed under their
 11 | terms and conditions, or any related information. Creative Commons
 12 | disclaims all liability for damages resulting from their use to the
 13 | fullest extent possible.
 14 | 
 15 | Using Creative Commons Public Licenses
 16 | 
 17 | Creative Commons public licenses provide a standard set of terms and
 18 | conditions that creators and other rights holders may use to share
 19 | original works of authorship and other material subject to copyright
 20 | and certain other rights specified in the public license below. The
 21 | following considerations are for informational purposes only, are not
 22 | exhaustive, and do not form part of our licenses.
 23 | 
 24 |      Considerations for licensors: Our public licenses are
 25 |      intended for use by those authorized to give the public
 26 |      permission to use material in ways otherwise restricted by
 27 |      copyright and certain other rights. Our licenses are
 28 |      irrevocable. Licensors should read and understand the terms
 29 |      and conditions of the license they choose before applying it.
 30 |      Licensors should also secure all rights necessary before
 31 |      applying our licenses so that the public can reuse the
 32 |      material as expected. Licensors should clearly mark any
 33 |      material not subject to the license. This includes other CC-
 34 |      licensed material, or material used under an exception or
 35 |      limitation to copyright. More considerations for licensors:
 36 | 	wiki.creativecommons.org/Considerations_for_licensors
 37 | 
 38 |      Considerations for the public: By using one of our public
 39 |      licenses, a licensor grants the public permission to use the
 40 |      licensed material under specified terms and conditions. If
 41 |      the licensor's permission is not necessary for any reason--for
 42 |      example, because of any applicable exception or limitation to
 43 |      copyright--then that use is not regulated by the license. Our
 44 |      licenses grant only permissions under copyright and certain
 45 |      other rights that a licensor has authority to grant. Use of
 46 |      the licensed material may still be restricted for other
 47 |      reasons, including because others have copyright or other
 48 |      rights in the material. A licensor may make special requests,
 49 |      such as asking that all changes be marked or described.
 50 |      Although not required by our licenses, you are encouraged to
 51 |      respect those requests where reasonable. More_considerations
 52 |      for the public: 
 53 | 	wiki.creativecommons.org/Considerations_for_licensees
 54 | 
 55 | =======================================================================
 56 | 
 57 | Creative Commons Attribution-NonCommercial 4.0 International Public
 58 | License
 59 | 
 60 | By exercising the Licensed Rights (defined below), You accept and agree
 61 | to be bound by the terms and conditions of this Creative Commons
 62 | Attribution-NonCommercial 4.0 International Public License ("Public
 63 | License"). To the extent this Public License may be interpreted as a
 64 | contract, You are granted the Licensed Rights in consideration of Your
 65 | acceptance of these terms and conditions, and the Licensor grants You
 66 | such rights in consideration of benefits the Licensor receives from
 67 | making the Licensed Material available under these terms and
 68 | conditions.
 69 | 
 70 | 
 71 | Section 1 -- Definitions.
 72 | 
 73 |   a. Adapted Material means material subject to Copyright and Similar
 74 |      Rights that is derived from or based upon the Licensed Material
 75 |      and in which the Licensed Material is translated, altered,
 76 |      arranged, transformed, or otherwise modified in a manner requiring
 77 |      permission under the Copyright and Similar Rights held by the
 78 |      Licensor. For purposes of this Public License, where the Licensed
 79 |      Material is a musical work, performance, or sound recording,
 80 |      Adapted Material is always produced where the Licensed Material is
 81 |      synched in timed relation with a moving image.
 82 | 
 83 |   b. Adapter's License means the license You apply to Your Copyright
 84 |      and Similar Rights in Your contributions to Adapted Material in
 85 |      accordance with the terms and conditions of this Public License.
 86 | 
 87 |   c. Copyright and Similar Rights means copyright and/or similar rights
 88 |      closely related to copyright including, without limitation,
 89 |      performance, broadcast, sound recording, and Sui Generis Database
 90 |      Rights, without regard to how the rights are labeled or
 91 |      categorized. For purposes of this Public License, the rights
 92 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 93 |      Rights.
 94 |   d. Effective Technological Measures means those measures that, in the
 95 |      absence of proper authority, may not be circumvented under laws
 96 |      fulfilling obligations under Article 11 of the WIPO Copyright
 97 |      Treaty adopted on December 20, 1996, and/or similar international
 98 |      agreements.
 99 | 
100 |   e. Exceptions and Limitations means fair use, fair dealing, and/or
101 |      any other exception or limitation to Copyright and Similar Rights
102 |      that applies to Your use of the Licensed Material.
103 | 
104 |   f. Licensed Material means the artistic or literary work, database,
105 |      or other material to which the Licensor applied this Public
106 |      License.
107 | 
108 |   g. Licensed Rights means the rights granted to You subject to the
109 |      terms and conditions of this Public License, which are limited to
110 |      all Copyright and Similar Rights that apply to Your use of the
111 |      Licensed Material and that the Licensor has authority to license.
112 | 
113 |   h. Licensor means the individual(s) or entity(ies) granting rights
114 |      under this Public License.
115 | 
116 |   i. NonCommercial means not primarily intended for or directed towards
117 |      commercial advantage or monetary compensation. For purposes of
118 |      this Public License, the exchange of the Licensed Material for
119 |      other material subject to Copyright and Similar Rights by digital
120 |      file-sharing or similar means is NonCommercial provided there is
121 |      no payment of monetary compensation in connection with the
122 |      exchange.
123 | 
124 |   j. Share means to provide material to the public by any means or
125 |      process that requires permission under the Licensed Rights, such
126 |      as reproduction, public display, public performance, distribution,
127 |      dissemination, communication, or importation, and to make material
128 |      available to the public including in ways that members of the
129 |      public may access the material from a place and at a time
130 |      individually chosen by them.
131 | 
132 |   k. Sui Generis Database Rights means rights other than copyright
133 |      resulting from Directive 96/9/EC of the European Parliament and of
134 |      the Council of 11 March 1996 on the legal protection of databases,
135 |      as amended and/or succeeded, as well as other essentially
136 |      equivalent rights anywhere in the world.
137 | 
138 |   l. You means the individual or entity exercising the Licensed Rights
139 |      under this Public License. Your has a corresponding meaning.
140 | 
141 | 
142 | Section 2 -- Scope.
143 | 
144 |   a. License grant.
145 | 
146 |        1. Subject to the terms and conditions of this Public License,
147 |           the Licensor hereby grants You a worldwide, royalty-free,
148 |           non-sublicensable, non-exclusive, irrevocable license to
149 |           exercise the Licensed Rights in the Licensed Material to:
150 | 
151 |             a. reproduce and Share the Licensed Material, in whole or
152 |                in part, for NonCommercial purposes only; and
153 | 
154 |             b. produce, reproduce, and Share Adapted Material for
155 |                NonCommercial purposes only.
156 | 
157 |        2. Exceptions and Limitations. For the avoidance of doubt, where
158 |           Exceptions and Limitations apply to Your use, this Public
159 |           License does not apply, and You do not need to comply with
160 |           its terms and conditions.
161 | 
162 |        3. Term. The term of this Public License is specified in Section
163 |           6(a).
164 | 
165 |        4. Media and formats; technical modifications allowed. The
166 |           Licensor authorizes You to exercise the Licensed Rights in
167 |           all media and formats whether now known or hereafter created,
168 |           and to make technical modifications necessary to do so. The
169 |           Licensor waives and/or agrees not to assert any right or
170 |           authority to forbid You from making technical modifications
171 |           necessary to exercise the Licensed Rights, including
172 |           technical modifications necessary to circumvent Effective
173 |           Technological Measures. For purposes of this Public License,
174 |           simply making modifications authorized by this Section 2(a)
175 |           (4) never produces Adapted Material.
176 | 
177 |        5. Downstream recipients.
178 | 
179 |             a. Offer from the Licensor -- Licensed Material. Every
180 |                recipient of the Licensed Material automatically
181 |                receives an offer from the Licensor to exercise the
182 |                Licensed Rights under the terms and conditions of this
183 |                Public License.
184 | 
185 |             b. No downstream restrictions. You may not offer or impose
186 |                any additional or different terms or conditions on, or
187 |                apply any Effective Technological Measures to, the
188 |                Licensed Material if doing so restricts exercise of the
189 |                Licensed Rights by any recipient of the Licensed
190 |                Material.
191 | 
192 |        6. No endorsement. Nothing in this Public License constitutes or
193 |           may be construed as permission to assert or imply that You
194 |           are, or that Your use of the Licensed Material is, connected
195 |           with, or sponsored, endorsed, or granted official status by,
196 |           the Licensor or others designated to receive attribution as
197 |           provided in Section 3(a)(1)(A)(i).
198 | 
199 |   b. Other rights.
200 | 
201 |        1. Moral rights, such as the right of integrity, are not
202 |           licensed under this Public License, nor are publicity,
203 |           privacy, and/or other similar personality rights; however, to
204 |           the extent possible, the Licensor waives and/or agrees not to
205 |           assert any such rights held by the Licensor to the limited
206 |           extent necessary to allow You to exercise the Licensed
207 |           Rights, but not otherwise.
208 | 
209 |        2. Patent and trademark rights are not licensed under this
210 |           Public License.
211 | 
212 |        3. To the extent possible, the Licensor waives any right to
213 |           collect royalties from You for the exercise of the Licensed
214 |           Rights, whether directly or through a collecting society
215 |           under any voluntary or waivable statutory or compulsory
216 |           licensing scheme. In all other cases the Licensor expressly
217 |           reserves any right to collect such royalties, including when
218 |           the Licensed Material is used other than for NonCommercial
219 |           purposes.
220 | 
221 | 
222 | Section 3 -- License Conditions.
223 | 
224 | Your exercise of the Licensed Rights is expressly made subject to the
225 | following conditions.
226 | 
227 |   a. Attribution.
228 | 
229 |        1. If You Share the Licensed Material (including in modified
230 |           form), You must:
231 | 
232 |             a. retain the following if it is supplied by the Licensor
233 |                with the Licensed Material:
234 | 
235 |                  i. identification of the creator(s) of the Licensed
236 |                     Material and any others designated to receive
237 |                     attribution, in any reasonable manner requested by
238 |                     the Licensor (including by pseudonym if
239 |                     designated);
240 | 
241 |                 ii. a copyright notice;
242 | 
243 |                iii. a notice that refers to this Public License;
244 | 
245 |                 iv. a notice that refers to the disclaimer of
246 |                     warranties;
247 | 
248 |                  v. a URI or hyperlink to the Licensed Material to the
249 |                     extent reasonably practicable;
250 | 
251 |             b. indicate if You modified the Licensed Material and
252 |                retain an indication of any previous modifications; and
253 | 
254 |             c. indicate the Licensed Material is licensed under this
255 |                Public License, and include the text of, or the URI or
256 |                hyperlink to, this Public License.
257 | 
258 |        2. You may satisfy the conditions in Section 3(a)(1) in any
259 |           reasonable manner based on the medium, means, and context in
260 |           which You Share the Licensed Material. For example, it may be
261 |           reasonable to satisfy the conditions by providing a URI or
262 |           hyperlink to a resource that includes the required
263 |           information.
264 | 
265 |        3. If requested by the Licensor, You must remove any of the
266 |           information required by Section 3(a)(1)(A) to the extent
267 |           reasonably practicable.
268 | 
269 |        4. If You Share Adapted Material You produce, the Adapter's
270 |           License You apply must not prevent recipients of the Adapted
271 |           Material from complying with this Public License.
272 | 
273 | 
274 | Section 4 -- Sui Generis Database Rights.
275 | 
276 | Where the Licensed Rights include Sui Generis Database Rights that
277 | apply to Your use of the Licensed Material:
278 | 
279 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
280 |      to extract, reuse, reproduce, and Share all or a substantial
281 |      portion of the contents of the database for NonCommercial purposes
282 |      only;
283 | 
284 |   b. if You include all or a substantial portion of the database
285 |      contents in a database in which You have Sui Generis Database
286 |      Rights, then the database in which You have Sui Generis Database
287 |      Rights (but not its individual contents) is Adapted Material; and
288 | 
289 |   c. You must comply with the conditions in Section 3(a) if You Share
290 |      all or a substantial portion of the contents of the database.
291 | 
292 | For the avoidance of doubt, this Section 4 supplements and does not
293 | replace Your obligations under this Public License where the Licensed
294 | Rights include other Copyright and Similar Rights.
295 | 
296 | 
297 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
298 | 
299 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
300 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
301 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
302 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
303 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
304 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
305 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
306 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
307 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
308 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
309 | 
310 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
311 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
312 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
313 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
314 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
315 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
316 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
317 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
318 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
319 | 
320 |   c. The disclaimer of warranties and limitation of liability provided
321 |      above shall be interpreted in a manner that, to the extent
322 |      possible, most closely approximates an absolute disclaimer and
323 |      waiver of all liability.
324 | 
325 | 
326 | Section 6 -- Term and Termination.
327 | 
328 |   a. This Public License applies for the term of the Copyright and
329 |      Similar Rights licensed here. However, if You fail to comply with
330 |      this Public License, then Your rights under this Public License
331 |      terminate automatically.
332 | 
333 |   b. Where Your right to use the Licensed Material has terminated under
334 |      Section 6(a), it reinstates:
335 | 
336 |        1. automatically as of the date the violation is cured, provided
337 |           it is cured within 30 days of Your discovery of the
338 |           violation; or
339 | 
340 |        2. upon express reinstatement by the Licensor.
341 | 
342 |      For the avoidance of doubt, this Section 6(b) does not affect any
343 |      right the Licensor may have to seek remedies for Your violations
344 |      of this Public License.
345 | 
346 |   c. For the avoidance of doubt, the Licensor may also offer the
347 |      Licensed Material under separate terms or conditions or stop
348 |      distributing the Licensed Material at any time; however, doing so
349 |      will not terminate this Public License.
350 | 
351 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
352 |      License.
353 | 
354 | 
355 | Section 7 -- Other Terms and Conditions.
356 | 
357 |   a. The Licensor shall not be bound by any additional or different
358 |      terms or conditions communicated by You unless expressly agreed.
359 | 
360 |   b. Any arrangements, understandings, or agreements regarding the
361 |      Licensed Material not stated herein are separate from and
362 |      independent of the terms and conditions of this Public License.
363 | 
364 | 
365 | Section 8 -- Interpretation.
366 | 
367 |   a. For the avoidance of doubt, this Public License does not, and
368 |      shall not be interpreted to, reduce, limit, restrict, or impose
369 |      conditions on any use of the Licensed Material that could lawfully
370 |      be made without permission under this Public License.
371 | 
372 |   b. To the extent possible, if any provision of this Public License is
373 |      deemed unenforceable, it shall be automatically reformed to the
374 |      minimum extent necessary to make it enforceable. If the provision
375 |      cannot be reformed, it shall be severed from this Public License
376 |      without affecting the enforceability of the remaining terms and
377 |      conditions.
378 | 
379 |   c. No term or condition of this Public License will be waived and no
380 |      failure to comply consented to unless expressly agreed to by the
381 |      Licensor.
382 | 
383 |   d. Nothing in this Public License constitutes or may be interpreted
384 |      as a limitation upon, or waiver of, any privileges and immunities
385 |      that apply to the Licensor or You, including from the legal
386 |      processes of any jurisdiction or authority.
387 | 
388 | =======================================================================
389 | 
390 | Creative Commons is not a party to its public
391 | licenses. Notwithstanding, Creative Commons may elect to apply one of
392 | its public licenses to material it publishes and in those instances
393 | will be considered the “Licensor.” The text of the Creative Commons
394 | public licenses is dedicated to the public domain under the CC0 Public
395 | Domain Dedication. Except for the limited purpose of indicating that
396 | material is shared under a Creative Commons public license or as
397 | otherwise permitted by the Creative Commons policies published at
398 | creativecommons.org/policies, Creative Commons does not authorize the
399 | use of the trademark "Creative Commons" or any other trademark or logo
400 | of Creative Commons without its prior written consent including,
401 | without limitation, in connection with any unauthorized modifications
402 | to any of its public licenses or any other arrangements,
403 | understandings, or agreements concerning use of licensed material. For
404 | the avoidance of doubt, this paragraph does not form part of the
405 | public licenses.
406 | 
407 | Creative Commons may be contacted at creativecommons.org.
408 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # VoiceLoop
  2 | PyTorch implementation of the method described in the paper [VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop](https://arxiv.org/abs/1707.06588).
  3 | 
  4 | <p align="center"><img width="70%" src="img/method.png" /></p>
  5 | 
  6 | VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled
  7 | in the wild. Some demo samples can be [found here](https://ytaigman.github.io/loop/site/).
  8 | 
  9 | ## Quick Links
 10 | - [Demo Samples](https://ytaigman.github.io/loop/site/) 
 11 | - [Quick Start](#quick-start)
 12 | - [Setup](#setup)
 13 | - [Training](#training)
 14 | 
 15 | ## Quick Start
 16 | Follow the instructions in [Setup](#setup) and then simply execute:
 17 |  ```bash
 18 |  python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
 19 |  ```
 20 |  Results will be placed in ```models/vctk/results```. It will generate 2 samples: 
 21 |   * The [generated sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_10.wav) will be saved with the gen_10.wav extension.
 22 |   * Its [ground-truth (test) sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.orig.wav) is also generated and is saved with the orig.wav extension.
 23 |   
 24 | You can also generate the same text but with a different speaker, specifically:
 25 |  ```bash
 26 |  python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth
 27 |  ```
 28 | Which will generate the following [sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_14.wav). 
 29 | 
 30 | Here is the corresponding attention plot: 
 31 | 
 32 | <p align="center"><img width="50%" src="img/attn_10.png" /><img width="50%" src="img/attn_14.png" /></p>
 33 | 
 34 | Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14. 
 35 | 
 36 | Finally, free text is also supported:
 37 |  ```bash
 38 | python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
 39 | ```
 40 | 
 41 | ## Setup
 42 | Requirements: Linux/OSX, Python2.7 and [PyTorch 0.1.12](http://pytorch.org/). Generation requires installing [phonemizer](https://github.com/bootphon/phonemizer), follow the setup instructions there. 
 43 | The current version of the code requires CUDA support for training. Generation can be done on the CPU.
 44 | 
 45 | ```bash
 46 | git clone https://github.com/facebookresearch/loop.git
 47 | cd loop
 48 | pip install -r scripts/requirements.txt
 49 | ```
 50 | 
 51 | ### Data
 52 | The data used to train the models in the paper can be downloaded via:
 53 | ```bash
 54 | bash scripts/download_data.sh
 55 | ```
 56 | 
 57 | The script downloads and preprocesses a subset of [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). This subset contains speakers with american accent.  
 58 | 
 59 | The dataset was preprocessed using [Merlin](http://www.cstr.ed.ac.uk/projects/merlin/) - from each audio clip we extracted vocoder features using the [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder. After downloading, the dataset will be located under subfolder ```data``` as follows:
 60 | 
 61 | ```
 62 | loop
 63 | ├── data
 64 |     └── vctk
 65 |         ├── norm_info
 66 |         │   ├── norm.dat
 67 |         ├── numpy_feautres
 68 |         │   ├── p294_001.npz
 69 |         │   ├── p294_002.npz
 70 |         │   └── ...
 71 |         └── numpy_features_valid
 72 | ```
 73 | 
 74 | The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.
 75 | 
 76 | ### Pretrained Models
 77 | Pretrainde models can be downloaded via:
 78 | ```bash
 79 | bash scripts/download_models.sh
 80 | ```
 81 | After downloading, the models will be located under subfolder ```models``` as follows:
 82 | 
 83 | ```
 84 | loop
 85 | ├── data
 86 | ├── models
 87 |     ├── blizzard
 88 |     ├── vctk
 89 |     │   ├── args.pth
 90 |     │   └── bestmodel.pth
 91 |     └── vctk_alt
 92 | ```
 93 | 
 94 | **Update 10/25/2017:** Single speaker model available in models/blizzard/
 95 | 
 96 | ### SPTK and WORLD
 97 | Finally, speech generation requires [SPTK3.9](http://sp-tk.sourceforge.net/) and [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder as done in Merlin. To download the executables: 
 98 | ```bash
 99 | bash scripts/download_tools.sh
100 | ```
101 | Which results the following sub directories:
102 | ```
103 | loop
104 | ├── data
105 | ├── models
106 | ├── tools
107 |     ├── SPTK-3.9
108 |     └── WORLD
109 | ```
110 |  
111 | ## Training
112 | 
113 | ### Single-Speaker
114 | Single speaker model is trained on [blizzard 2011](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/). Data should be downloaded and prepared as described above. Once the data is ready, run:
115 | ```bash
116 | python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10
117 | ```
118 | Then, continue training the model with :
119 | ```bash
120 | python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90
121 | ```
122 | ### Multi-Speaker
123 | Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:
124 | ```bash
125 | python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90
126 | ```
127 | Then, continue training the model using noise level of 2, on full sequences:
128 | ```bash
129 | python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90
130 | ```
131 | 
132 | ## Citation
133 | If you find this code useful in your research then please cite:
134 | 
135 | ```
136 | @article{taigman2017voice,
137 |   title           = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop},
138 |   author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
139 |   journal         = {ArXiv e-prints},
140 |   archivePrefix   = "arXiv",
141 |   eprinttype      = {arxiv},
142 |   eprint          = {1705.03122},
143 |   primaryClass    = "cs.CL",
144 |   year            = {2017}
145 |   month           = October,
146 | }
147 | ```
148 | 
149 | ## License
150 | Loop has a CC-BY-NC license.
151 | 


--------------------------------------------------------------------------------
/data.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | 
  7 | from functools import partial
  8 | from collections import defaultdict
  9 | import numpy as np
 10 | import os
 11 | 
 12 | import torch
 13 | import torch.utils.data as data
 14 | 
 15 | 
 16 | # Taken from
 17 | # https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Dataset.py
 18 | def batchify(data):
 19 |     out, lengths = None, None
 20 | 
 21 |     lengths = [x.size(0) for x in data]
 22 |     max_length = max(lengths)
 23 | 
 24 |     if data[0].dim() == 1:
 25 |         out = data[0].new(len(data), max_length).fill_(0)
 26 |         for i in range(len(data)):
 27 |             data_length = data[i].size(0)
 28 |             out[i].narrow(0, 0, data_length).copy_(data[i])
 29 |     else:
 30 |         feat_size = data[0].size(1)
 31 |         out = data[0].new(len(data), max_length, feat_size).fill_(0)
 32 |         for i in range(len(data)):
 33 |             data_length = data[i].size(0)
 34 |             out[i].narrow(0, 0, data_length).copy_(data[i])
 35 | 
 36 |     return out, lengths
 37 | 
 38 | 
 39 | def collate_by_input_length(batch, max_seq_len):
 40 |     "Puts each data field into a tensor with outer dimension batch size"
 41 |     if torch.is_tensor(batch[0]):
 42 |         return batchify(batch)
 43 |     elif isinstance(batch[0], int):
 44 |         return torch.LongTensor(batch)
 45 |     else:
 46 |         new_batch = [x for x in batch if x[1].size(0) < max_seq_len]
 47 |         if len(batch) == 0:
 48 |             return (None, None), (None, None), None
 49 | 
 50 |         batch = new_batch
 51 |         transposed = zip(*batch)
 52 |         (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers = \
 53 |             [collate_by_input_length(samples, max_seq_len)
 54 |                 for samples in transposed]
 55 | 
 56 |         # within batch sorting by decreasing length for variable length rnns
 57 |         batch = zip(srcBatch, tgtBatch, tgtLengths, speakers)
 58 |         batch, srcLengths = zip(*sorted(zip(batch, srcLengths),
 59 |                                         key=lambda x: -x[1]))
 60 |         srcBatch, tgtBatch, tgtLengths, speakers = zip(*batch)
 61 | 
 62 |         srcBatch = torch.stack(srcBatch, 0).transpose(0, 1).contiguous()
 63 |         tgtBatch = torch.stack(tgtBatch, 0).transpose(0, 1).contiguous()
 64 |         srcLengths = torch.LongTensor(srcLengths)
 65 |         tgtLengths = torch.LongTensor(tgtLengths)
 66 |         speakers = torch.LongTensor(speakers).view(-1, 1)
 67 | 
 68 |         return (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers
 69 | 
 70 |     raise TypeError(("batch must contain tensors, numbers, dicts or \
 71 |                      lists; found {}".format(type(batch[0]))))
 72 | 
 73 | 
 74 | class NpzFolder(data.Dataset):
 75 |     NPZ_EXTENSION = 'npz'
 76 | 
 77 |     def __init__(self, root, single_spkr=False):
 78 |         self.root = root
 79 |         self.npzs = self.make_dataset(self.root)
 80 | 
 81 |         if len(self.npzs) == 0:
 82 |             raise(RuntimeError("Found 0 npz in subfolders of: " + root + "\n"
 83 |                                "Supported image extensions are: " +
 84 |                                self.NPZ_EXTENSION))
 85 | 
 86 |         if single_spkr:
 87 |             self.speakers = defaultdict(lambda: 0)
 88 |         else:
 89 |             self.speakers = []
 90 |             for fname in self.npzs:
 91 |                 self.speakers += [os.path.basename(fname).split('_')[0]]
 92 |             self.speakers = list(set(self.speakers))
 93 |             self.speakers.sort()
 94 |             self.speakers = {v: i for i, v in enumerate(self.speakers)}
 95 | 
 96 |         code2phone = np.load(self.npzs[0])['code2phone']
 97 |         self.dict = {v: k for k, v in enumerate(code2phone)}
 98 | 
 99 |     def __getitem__(self, index):
100 |         path = self.npzs[index]
101 |         txt, feat, spkr = self.loader(path)
102 | 
103 |         return txt, feat, self.speakers[spkr]
104 | 
105 |     def __len__(self):
106 |         return len(self.npzs)
107 | 
108 |     def make_dataset(self, dir):
109 |         images = []
110 | 
111 |         for root, _, fnames in sorted(os.walk(dir)):
112 |             for fname in fnames:
113 |                 if self.NPZ_EXTENSION in fname:
114 |                     path = os.path.join(root, fname)
115 |                     images.append(path)
116 | 
117 |         return images
118 | 
119 |     def loader(self, path):
120 |         feat = np.load(path)
121 | 
122 |         txt = feat['phonemes'].astype('int64')
123 |         txt = torch.from_numpy(txt)
124 | 
125 |         audio = feat['audio_features']
126 |         audio = torch.from_numpy(audio)
127 | 
128 |         spkr = os.path.basename(path).split('_')[0]
129 | 
130 |         return txt, audio, spkr
131 | 
132 | 
133 | class NpzLoader(data.DataLoader):
134 |     def __init__(self, *args, **kwargs):
135 |         kwargs['collate_fn'] = partial(collate_by_input_length,
136 |                                        max_seq_len=kwargs['max_seq_len'])
137 |         del kwargs['max_seq_len']
138 | 
139 |         data.DataLoader.__init__(self, *args, **kwargs)
140 | 
141 | 
142 | class TBPTTIter(object):
143 |     """
144 |     Iterator for truncated batch propagation through time(tbptt) training.
145 |     Target sequence is segmented while input sequence remains the same.
146 |     """
147 |     def __init__(self, src, trgt, spkr, seq_len):
148 |         self.seq_len = seq_len
149 |         self.start = True
150 | 
151 |         self.speakers = spkr
152 |         self.srcBatch = src[0]
153 |         self.srcLenths = src[1]
154 | 
155 |         # split batch
156 |         self.tgtBatch = list(torch.split(trgt[0], self.seq_len, 0))
157 |         self.tgtBatch.reverse()
158 |         self.len = len(self.tgtBatch)
159 | 
160 |         # split length list
161 |         batch_seq_len = len(self.tgtBatch)
162 |         self.tgtLenths = [self.split_length(l, batch_seq_len) for l in trgt[1]]
163 |         self.tgtLenths = torch.stack(self.tgtLenths)
164 |         self.tgtLenths = list(torch.split(self.tgtLenths, 1, 1))
165 |         self.tgtLenths = [x.squeeze() for x in self.tgtLenths]
166 |         self.tgtLenths.reverse()
167 | 
168 |         assert len(self.tgtLenths) == len(self.tgtBatch)
169 | 
170 |     def split_length(self, seq_size, batch_seq_len):
171 |         seq = [self.seq_len] * (seq_size / self.seq_len)
172 |         if seq_size % self.seq_len != 0:
173 |             seq += [seq_size % self.seq_len]
174 |         seq += [0] * (batch_seq_len - len(seq))
175 |         return torch.LongTensor(seq)
176 | 
177 |     def __next__(self):
178 |         if len(self.tgtBatch) == 0:
179 |             raise StopIteration()
180 | 
181 |         if self.len > len(self.tgtBatch):
182 |             self.start = False
183 | 
184 |         return (self.srcBatch, self.srcLenths), \
185 |                (self.tgtBatch.pop(), self.tgtLenths.pop()), \
186 |                self.speakers, self.start
187 | 
188 |     next = __next__
189 | 
190 |     def __iter__(self):
191 |         return self
192 | 
193 |     def __len__(self):
194 |         return self.len
195 | 


--------------------------------------------------------------------------------
/generate.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | 
  7 | import os
  8 | import argparse
  9 | import numpy as np
 10 | import phonemizer
 11 | import string
 12 | 
 13 | import torch
 14 | from torch.autograd import Variable
 15 | 
 16 | from model import Loop
 17 | from data import NpzFolder
 18 | from utils import generate_merlin_wav
 19 | 
 20 | 
 21 | parser = argparse.ArgumentParser(description='PyTorch Phonological Loop \
 22 |                                     Generation')
 23 | parser.add_argument('--npz', type=str, default='',
 24 |                     help='Dataset sample to generate.')
 25 | parser.add_argument('--text', default='',
 26 |                     type=str, help='Free text to generate.')
 27 | parser.add_argument('--spkr', default=0,
 28 |                     type=int, help='Speaker id.')
 29 | parser.add_argument('--checkpoint', default='checkpoints/vctk/lastmodel.pth',
 30 |                     type=str, help='Model used for generation.')
 31 | parser.add_argument('--gpu', default=-1,
 32 |                     type=int, help='GPU device ID, use -1 for CPU.')
 33 | 
 34 | 
 35 | # init
 36 | args = parser.parse_args()
 37 | if args.gpu >= 0:
 38 |     torch.cuda.set_device(args.gpu)
 39 | 
 40 | 
 41 | def text2phone(text, char2code):
 42 |     seperator = phonemizer.separator.Separator('', '', ' ')
 43 |     ph = phonemizer.phonemize(text, separator=seperator)
 44 |     ph = ph.split(' ')
 45 |     ph.remove('')
 46 | 
 47 |     result = [char2code[p] for p in ph]
 48 |     return torch.LongTensor(result)
 49 | 
 50 | 
 51 | def trim_pred(out, attn):
 52 |     tq = attn.abs().sum(1).data
 53 | 
 54 |     for stopi in range(1, tq.size(0)):
 55 |         col_sum = attn[:stopi, :].abs().sum(0).data.squeeze()
 56 |         if tq[stopi][0] < 0.5 and col_sum[-1] > 4:
 57 |             break
 58 | 
 59 |     out = out[:stopi, :]
 60 |     attn = attn[:stopi, :]
 61 | 
 62 |     return out, attn
 63 | 
 64 | 
 65 | def npy_loader_phonemes(path):
 66 |     feat = np.load(path)
 67 | 
 68 |     txt = feat['phonemes'].astype('int64')
 69 |     txt = torch.from_numpy(txt)
 70 | 
 71 |     audio = feat['audio_features']
 72 |     audio = torch.from_numpy(audio)
 73 | 
 74 |     return txt, audio
 75 | 
 76 | 
 77 | def main():
 78 |     weights = torch.load(args.checkpoint,
 79 |                          map_location=lambda storage, loc: storage)
 80 |     opt = torch.load(os.path.dirname(args.checkpoint) + '/args.pth')
 81 |     train_args = opt[0]
 82 | 
 83 |     char2code = {'aa': 0, 'ae': 1, 'ah': 2, 'ao': 3, 'aw': 4, 'ax': 5,  'ay': 6,
 84 |                  'b': 7, 'ch': 8, 'd': 9, 'dh': 10, 'eh': 11, 'er': 12, 'ey': 13,
 85 |                  'f': 14, 'g': 15, 'hh': 16, 'i': 17, 'ih': 18, 'iy': 19, 'jh': 20,
 86 |                  'k': 21, 'l': 22, 'm': 23, 'n': 24, 'ng': 25, 'ow': 26, 'oy': 27,
 87 |                  'p': 28, 'pau': 29, 'r': 30, 's': 31, 'sh': 32, 'ssil': 33,
 88 |                  't': 34, 'th': 35, 'uh': 36, 'uw': 37, 'v': 38, 'w': 39, 'y': 40,
 89 |                  'z': 41}
 90 |     nspkr = train_args.nspk
 91 | 
 92 |     norm_path = None
 93 |     if os.path.exists(train_args.data + '/norm_info/norm.dat'):
 94 |         norm_path = train_args.data + '/norm_info/norm.dat'
 95 |     elif os.path.exists(os.path.dirname(args.checkpoint) + '/norm.dat'):
 96 |         norm_path = os.path.dirname(args.checkpoint) + '/norm.dat'
 97 |     else:
 98 |         print('ERROR: Failed to find norm file.')
 99 |         return
100 |     train_args.noise = 0
101 | 
102 |     model = Loop(train_args)
103 |     model.load_state_dict(weights)
104 |     if args.gpu >= 0:
105 |         model.cuda()
106 |     model.eval()
107 | 
108 |     if args.spkr not in range(nspkr):
109 |         print('ERROR: Unknown speaker id: %d.' % args.spkr)
110 |         return
111 | 
112 |     txt, feat, spkr, output_fname = None, None, None, None
113 |     if args.npz is not '':
114 |         txt, feat = npy_loader_phonemes(args.npz)
115 | 
116 |         txt = Variable(txt.unsqueeze(1), volatile=True)
117 |         feat = Variable(feat.unsqueeze(1), volatile=True)
118 |         spkr = Variable(torch.LongTensor([args.spkr]), volatile=True)
119 | 
120 |         fname = os.path.basename(args.npz)[:-4]
121 |         output_fname = fname + '.gen_' + str(args.spkr)
122 |     elif args.text is not '':
123 |         txt = text2phone(args.text, char2code)
124 |         feat = torch.FloatTensor(txt.size(0)*20, 63)
125 |         spkr = torch.LongTensor([args.spkr])
126 | 
127 |         txt = Variable(txt.unsqueeze(1), volatile=True)
128 |         feat = Variable(feat.unsqueeze(1), volatile=True)
129 |         spkr = Variable(spkr, volatile=True)
130 | 
131 |         # slugify input string to file name
132 |         fname = args.text.replace(' ', '_')
133 |         valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
134 |         fname = ''.join(c for c in fname if c in valid_chars)
135 | 
136 |         output_fname = fname + '.gen_' + str(args.spkr)
137 |     else:
138 |         print('ERROR: Must supply npz file path or text as source.')
139 |         return
140 | 
141 |     if args.gpu >= 0:
142 |         txt = txt.cuda()
143 |         feat = feat.cuda()
144 |         spkr = spkr.cuda()
145 | 
146 | 
147 |     out, attn = model([txt, spkr], feat)
148 |     out, attn = trim_pred(out, attn)
149 | 
150 |     output_dir = os.path.join(os.path.dirname(args.checkpoint), 'results')
151 |     if not os.path.exists(output_dir):
152 |         os.makedirs(output_dir)
153 | 
154 |     generate_merlin_wav(out.data.cpu().numpy(),
155 |                         output_dir,
156 |                         output_fname,
157 |                         norm_path)
158 | 
159 |     if args.npz is not '':
160 |         output_orig_fname = os.path.basename(args.npz)[:-4] + '.orig'
161 |         generate_merlin_wav(feat[:, 0, :].data.cpu().numpy(),
162 |                             output_dir,
163 |                             output_orig_fname,
164 |                             norm_path)
165 | 
166 | 
167 | if __name__ == '__main__':
168 |     main()
169 | 


--------------------------------------------------------------------------------
/img/attn_10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookarchive/loop/112975599b1838a33f139a6c6df0fcd9953ee33f/img/attn_10.png


--------------------------------------------------------------------------------
/img/attn_14.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookarchive/loop/112975599b1838a33f139a6c6df0fcd9953ee33f/img/attn_14.png


--------------------------------------------------------------------------------
/img/method.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookarchive/loop/112975599b1838a33f139a6c6df0fcd9953ee33f/img/method.png


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | 
  7 | import torch
  8 | import torch.nn as nn
  9 | from torch.autograd import Variable
 10 | from torch.nn.utils.rnn import pad_packed_sequence as unpack
 11 | from torch.nn.utils.rnn import pack_padded_sequence as pack
 12 | 
 13 | 
 14 | def getLinear(dim_in, dim_out):
 15 |     return nn.Sequential(nn.Linear(dim_in, dim_in/10),
 16 |                          nn.ReLU(),
 17 |                          nn.Linear(dim_in/10, dim_out))
 18 | 
 19 | 
 20 | class MaskedMSE(nn.Module):
 21 |     def __init__(self):
 22 |         super(MaskedMSE, self).__init__()
 23 |         self.criterion = nn.MSELoss(size_average=False)
 24 | 
 25 |     # Taken from
 26 |     # https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation
 27 |     @staticmethod
 28 |     def _sequence_mask(sequence_length, max_len):
 29 |         batch_size = sequence_length.size(0)
 30 |         seq_range = torch.arange(0, max_len).long()
 31 |         seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len)
 32 |         seq_range_expand = Variable(seq_range_expand)
 33 |         if sequence_length.is_cuda:
 34 |             seq_range_expand = seq_range_expand.cuda()
 35 |         seq_length_expand = sequence_length.unsqueeze(1) \
 36 |                                            .expand_as(seq_range_expand)
 37 |         return (seq_range_expand < seq_length_expand).t().float()
 38 | 
 39 |     def forward(self, input, target, lengths):
 40 |         max_len = input.size(0)
 41 |         mask = self._sequence_mask(lengths, max_len).unsqueeze(2)
 42 |         mask_ = mask.expand_as(input)
 43 |         self.loss = self.criterion(input*mask_, target*mask_)
 44 |         self.loss = self.loss / mask.sum()
 45 |         return self.loss
 46 | 
 47 | 
 48 | class Encoder(nn.Module):
 49 |     def __init__(self, opt):
 50 |         super(Encoder, self).__init__()
 51 |         self.hidden_size = opt.hidden_size
 52 |         self.vocabulary_size = opt.vocabulary_size
 53 |         self.nspk = opt.nspk
 54 |         self.lut_p = nn.Embedding(self.vocabulary_size,
 55 |                                   self.hidden_size,
 56 |                                   max_norm=1.0)
 57 |         self.lut_s = nn.Embedding(self.nspk,
 58 |                                   self.hidden_size,
 59 |                                   max_norm=1.0)
 60 | 
 61 |     def forward(self, input, speakers):
 62 |         if isinstance(input, tuple):
 63 |             lengths = input[1].data.view(-1).tolist()
 64 |             outputs = pack(self.lut_p(input[0]), lengths)
 65 |         else:
 66 |             outputs = self.lut_p(input)
 67 |         if isinstance(input, tuple):
 68 |             outputs = unpack(outputs)[0]
 69 | 
 70 |         ident = self.lut_s(speakers)
 71 |         if ident.dim() == 3:
 72 |             ident = ident.squeeze(1)
 73 | 
 74 |         return outputs, ident
 75 | 
 76 | 
 77 | class GravesAttention(nn.Module):
 78 |     COEF = 0.3989422917366028  # numpy.sqrt(1/(2*numpy.pi))
 79 | 
 80 |     def __init__(self, batch_size, mem_elem, K, attention_alignment):
 81 |         super(GravesAttention, self).__init__()
 82 |         self.K = K
 83 |         self.attention_alignment = attention_alignment
 84 |         self.epsilon = 1e-5
 85 | 
 86 |         self.sm = nn.Softmax()
 87 |         self.N_a = getLinear(mem_elem, 3*K)
 88 |         self.J = Variable(torch.arange(0, 500)
 89 |                                .expand_as(torch.Tensor(batch_size,
 90 |                                           self.K,
 91 |                                           500)),
 92 |                           requires_grad=False)
 93 | 
 94 |     def forward(self, C, context, mu_tm1):
 95 |         gbk_t = self.N_a(C.view(C.size(0), C.size(1) * C.size(2)))
 96 |         gbk_t = gbk_t.view(gbk_t.size(0), -1, self.K)
 97 | 
 98 |         # attention model parameters
 99 |         g_t = gbk_t[:, 0, :]
100 |         b_t = gbk_t[:, 1, :]
101 |         k_t = gbk_t[:, 2, :]
102 | 
103 |         # attention GMM parameters
104 |         g_t = self.sm(g_t) + self.epsilon
105 |         sig_t = torch.exp(b_t) + self.epsilon
106 |         mu_t = mu_tm1 + self.attention_alignment * torch.exp(k_t)
107 | 
108 |         g_t = g_t.unsqueeze(2).expand(g_t.size(0),
109 |                                       g_t.size(1),
110 |                                       context.size(1))
111 |         sig_t = sig_t.unsqueeze(2).expand_as(g_t)
112 |         mu_t_ = mu_t.unsqueeze(2).expand_as(g_t)
113 |         j = self.J[:g_t.size(0), :, :context.size(1)]
114 | 
115 |         # attention weights
116 |         phi_t = g_t * torch.exp(-0.5 * sig_t * (mu_t_ - j)**2)
117 |         alpha_t = self.COEF * torch.sum(phi_t, 1)
118 | 
119 |         c_t = torch.bmm(alpha_t, context).transpose(0, 1).squeeze(0)
120 |         return c_t, mu_t, alpha_t
121 | 
122 | 
123 | class Decoder(nn.Module):
124 |     def __init__(self, opt):
125 |         super(Decoder, self).__init__()
126 |         self.K = opt.K
127 |         self.hidden_size = opt.hidden_size
128 |         self.output_size = opt.output_size
129 | 
130 |         self.mem_size = opt.mem_size
131 |         self.mem_feat_size = opt.output_size + opt.hidden_size
132 |         self.mem_elem = self.mem_size * self.mem_feat_size
133 | 
134 |         self.attn = GravesAttention(opt.batch_size,
135 |                                     self.mem_elem,
136 |                                     self.K,
137 |                                     opt.attention_alignment)
138 | 
139 |         self.N_o = getLinear(self.mem_elem, self.hidden_size)
140 |         self.output = nn.Linear(self.hidden_size, self.output_size)
141 |         self.N_u = getLinear(self.mem_elem, self.mem_feat_size)
142 | 
143 |         self.F_u = nn.Linear(self.hidden_size,  self.hidden_size)
144 |         self.F_o = nn.Linear(self.hidden_size,  self.hidden_size)
145 | 
146 |     def init_buffer(self, ident, start=True):
147 |         mem_feat_size = self.hidden_size + self.output_size
148 |         batch_size = ident.size(0)
149 | 
150 |         if start:
151 |             self.mu_t = Variable(ident.data.new(batch_size, self.K).zero_())
152 |             self.S_t = Variable(ident.data.new(batch_size,
153 |                                                mem_feat_size,
154 |                                                self.mem_size).zero_())
155 | 
156 |             # initialize with identity
157 |             self.S_t[:, :self.hidden_size, :] = ident.unsqueeze(2) \
158 |                                                      .expand(ident.size(0),
159 |                                                              ident.size(1),
160 |                                                              self.mem_size)
161 |         else:
162 |             self.mu_t = self.mu_t.detach()
163 |             self.S_t = self.S_t.detach()
164 | 
165 |     def update_buffer(self, S_tm1, c_t, o_tm1, ident):
166 |         # concat previous output & context
167 |         idt = torch.tanh(self.F_u(ident))
168 |         o_tm1 = o_tm1.squeeze(0)
169 |         z_t = torch.cat([c_t + idt, o_tm1/30], 1)
170 |         z_t = z_t.unsqueeze(2)
171 |         Sp = torch.cat([z_t, S_tm1[:, :, :-1]], 2)
172 | 
173 |         # update S
174 |         u = self.N_u(Sp.view(Sp.size(0), -1))
175 |         u[:, :idt.size(1)] = u[:, :idt.size(1)] + idt
176 |         u = u.unsqueeze(2)
177 |         S = torch.cat([u, S_tm1[:, :, :-1]], 2)
178 | 
179 |         return S
180 | 
181 |     def forward(self, x, ident, context, start=True):
182 |         out, attns = [], []
183 |         o_t = x[0]
184 |         self.init_buffer(ident, start)
185 | 
186 |         for o_tm1 in torch.split(x, 1):
187 |             if not self.training:
188 |                 o_tm1 = o_t.unsqueeze(0)
189 | 
190 |             # predict weighted context based on S
191 |             c_t, mu_t, alpha_t = self.attn(self.S_t,
192 |                                            context.transpose(0, 1),
193 |                                            self.mu_t)
194 | 
195 |             # advance mu and update buffer
196 |             self.S_t = self.update_buffer(self.S_t, c_t, o_tm1, ident)
197 |             self.mu_t = mu_t
198 | 
199 |             # predict next time step based on buffer content
200 |             ot_out = self.N_o(self.S_t.view(self.S_t.size(0), -1))
201 |             sp_out = self.F_o(ident)
202 |             o_t = self.output(ot_out + sp_out)
203 | 
204 |             out += [o_t]
205 |             attns += [alpha_t.squeeze()]
206 | 
207 |         out_seq = torch.stack(out)
208 |         attns_seq = torch.stack(attns)
209 | 
210 |         return out_seq, attns_seq
211 | 
212 | 
213 | class Loop(nn.Module):
214 |     def __init__(self, opt):
215 |         super(Loop, self).__init__()
216 |         self.encoder = Encoder(opt)
217 |         self.decoder = Decoder(opt)
218 |         self.noise = opt.noise
219 |         self.output_size = opt.output_size
220 | 
221 |     def init_input(self, tgt, start):
222 |         if start:
223 |             self.x_tm1 = torch.zeros(1, tgt.size(1), tgt.size(2)).type_as(tgt.data)
224 | 
225 |         if tgt.size(0) > 1:
226 |             inp = torch.cat([self.x_tm1, tgt[:-1].data])
227 |         else:
228 |             inp = self.x_tm1
229 | 
230 |         if self.noise > 0:
231 |             noise = tgt.data.new(inp.size()).normal_(0, self.noise)
232 |             inp += noise
233 | 
234 |         if not self.training:
235 |             inp.zero_()
236 | 
237 |         self.x_tm1 = tgt[-1].data.unsqueeze(0)
238 |         return Variable(inp)
239 | 
240 |     def cuda(self, device_id=None):
241 |         nn.Module.cuda(self, device_id)
242 |         self.decoder.attn.J = self.decoder.attn.J.cuda(device_id)
243 | 
244 |     def forward(self, src, tgt, start=True):
245 |         x = self.init_input(tgt, start)
246 | 
247 |         context, ident = self.encoder(src[0], src[1])
248 |         out, attn = self.decoder(x, ident, context, start)
249 | 
250 |         return out, attn
251 | 


--------------------------------------------------------------------------------
/notebooks/generate.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys\n",
 12 |     "sys.path.append('..')\n",
 13 |     "import os\n",
 14 |     "import shutil\n",
 15 |     "import torch\n",
 16 |     "\n",
 17 |     "import matplotlib\n",
 18 |     "import matplotlib.cm as cm\n",
 19 |     "import matplotlib.pyplot as plt\n",
 20 |     "\n",
 21 |     "import numpy\n",
 22 |     "%matplotlib notebook\n",
 23 |     "\n",
 24 |     "from data import *\n",
 25 |     "from model import Loop\n",
 26 |     "from utils import generate_merlin_wav\n",
 27 |     "\n",
 28 |     "from torch.autograd import Variable\n",
 29 |     "from IPython.display import Audio\n",
 30 |     "\n",
 31 |     "def plot(data, labels, dict_file):\n",
 32 |     "    labels_dict = dict_file\n",
 33 |     "    labels_dict = {v: k for k, v in labels_dict.iteritems()}\n",
 34 |     "    labels = [labels_dict[x].decode('latin-1') for x in labels]\n",
 35 |     "\n",
 36 |     "    axarr = plt.subplot()\n",
 37 |     "    axarr.imshow(data.T, aspect='auto', origin='lower', interpolation='nearest', cmap=cm.viridis)\n",
 38 |     "    axarr.set_yticks(numpy.arange(0, len(data.T)))\n",
 39 |     "    axarr.set_yticklabels(labels, rotation=90)"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {
 46 |     "collapsed": true
 47 |    },
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "data_path = os.path.abspath('../data/vctk/numpy_features')\n",
 51 |     "norm_info = os.path.abspath('../data/vctk/norm_info/norm.dat')\n",
 52 |     "    \n",
 53 |     "train_dataset = NpzFolder(data_path)\n",
 54 |     "valid_dataset = NpzFolder(data_path + '_valid')"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "torch.cuda.set_device(1)\n",
 64 |     "\n",
 65 |     "checkpoint = '../models/vctk'\n",
 66 |     "weights = torch.load(checkpoint + '/bestmodel.pth')\n",
 67 |     "\n",
 68 |     "args = torch.load(checkpoint + '/args.pth')\n",
 69 |     "opt = args[0]\n",
 70 |     "opt.noise = 0\n",
 71 |     "\n",
 72 |     "model = Loop(opt)\n",
 73 |     "model.load_state_dict(weights)\n",
 74 |     "model.cuda();\n",
 75 |     "model.eval();"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "scrolled": false
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "ID = 5\n",
 87 |     "txt, feat, _ = valid_dataset[8]\n",
 88 |     "\n",
 89 |     "txt = Variable(txt.unsqueeze(1), volatile=True).cuda()\n",
 90 |     "feat = Variable(feat.unsqueeze(1), volatile=True).cuda()\n",
 91 |     "spkr = Variable(torch.LongTensor([ID]), volatile=True).cuda()\n",
 92 |     "\n",
 93 |     "out, attn = model([txt, spkr], feat)\n",
 94 |     "\n",
 95 |     "generate_merlin_wav(out.data.cpu().numpy(), \"/tmp/gen\", file_basename='test',\n",
 96 |     "                    norm_info_file=norm_info, do_post_filtering=True)\n",
 97 |     "Audio('/tmp/gen/test.wav')"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "plot(attn.squeeze().data.cpu().numpy(), txt[:,0].squeeze().data.tolist(), valid_dataset.dict)"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {},
113 |    "outputs": [],
114 |    "source": [
115 |     "generate_merlin_wav(feat.data.cpu().numpy(), \"/tmp/gen\", file_basename='test',\n",
116 |     "                    norm_info_file=norm_info, do_post_filtering=True)\n",
117 |     "Audio('/tmp/gen/test.wav')"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "metadata": {
124 |     "collapsed": true
125 |    },
126 |    "outputs": [],
127 |    "source": []
128 |   }
129 |  ],
130 |  "metadata": {
131 |   "kernelspec": {
132 |    "display_name": "Python 2",
133 |    "language": "python",
134 |    "name": "python2"
135 |   },
136 |   "language_info": {
137 |    "codemirror_mode": {
138 |     "name": "ipython",
139 |     "version": 2
140 |    },
141 |    "file_extension": ".py",
142 |    "mimetype": "text/x-python",
143 |    "name": "python",
144 |    "nbconvert_exporter": "python",
145 |    "pygments_lexer": "ipython2",
146 |    "version": "2.7.13"
147 |   }
148 |  },
149 |  "nbformat": 4,
150 |  "nbformat_minor": 2
151 | }
152 | 


--------------------------------------------------------------------------------
/scripts/download_data.sh:
--------------------------------------------------------------------------------
 1 | # Copyright 2017-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | 
 7 | mkdir data
 8 | pushd data
 9 | wget https://dl.fbaipublicfiles.com/loop/vctk_data.zip
10 | unzip vctk_data.zip
11 | rm vctk_data.zip
12 | 


--------------------------------------------------------------------------------
/scripts/download_models.sh:
--------------------------------------------------------------------------------
 1 | # Copyright 2017-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | 
 7 | mkdir -p models
 8 | pushd models
 9 | 
10 | wget https://dl.fbaipublicfiles.com/loop/vctk_model.zip
11 | unzip vctk_model.zip
12 | rm vctk_model.zip
13 | 
14 | wget https://dl.fbaipublicfiles.com/loop/vctk_alt_model.zip
15 | unzip vctk_alt_model.zip
16 | rm vctk_alt_model.zip
17 | 
18 | wget https://dl.fbaipublicfiles.com/loop/blizzard_model.zip
19 | unzip blizzard_model.zip
20 | rm blizzard_model.zip
21 | 
22 | popd
23 | 


--------------------------------------------------------------------------------
/scripts/download_tools.sh:
--------------------------------------------------------------------------------
 1 | # Copyright 2017-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | 
 7 | 
 8 | echo "Downloading merlin"
 9 | git clone https://github.com/CSTR-Edinburgh/merlin
10 | 
11 | pushd merlin/tools
12 | ./compile_tools.sh
13 | popd
14 | 
15 | mv merlin/tools/bin tools
16 | rm -rf merlin
17 | 


--------------------------------------------------------------------------------
/scripts/requirements.txt:
--------------------------------------------------------------------------------
1 | visdom
2 | numpy
3 | tqdm
4 | scipy
5 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | 
  7 | import os
  8 | import argparse
  9 | import visdom
 10 | import numpy as np
 11 | from tqdm import tqdm
 12 | 
 13 | import torch
 14 | import torch.optim as optim
 15 | 
 16 | from data import NpzFolder, NpzLoader, TBPTTIter
 17 | from model import Loop, MaskedMSE
 18 | from utils import create_output_dir, wrap, check_grad
 19 | 
 20 | 
 21 | parser = argparse.ArgumentParser(description='PyTorch Loop')
 22 | # Env options:
 23 | parser.add_argument('--epochs', type=int, default=92, metavar='N',
 24 |                     help='number of epochs to train (default: 92)')
 25 | parser.add_argument('--seed', type=int, default=1, metavar='S',
 26 |                     help='random seed (default: 1)')
 27 | parser.add_argument('--expName', type=str, default='vctk', metavar='E',
 28 |                     help='Experiment name')
 29 | parser.add_argument('--data', default='data/vctk',
 30 |                     metavar='D', type=str, help='Data path')
 31 | parser.add_argument('--checkpoint', default='',
 32 |                     metavar='C', type=str, help='Checkpoint path')
 33 | parser.add_argument('--gpu', default=0,
 34 |                     metavar='G', type=int, help='GPU device ID')
 35 | parser.add_argument('--visualize', action='store_true',
 36 |                     help='Visualize train and validation loss.')
 37 | # Data options
 38 | parser.add_argument('--seq-len', type=int, default=100,
 39 |                     help='Sequence length for tbptt')
 40 | parser.add_argument('--max-seq-len', type=int, default=1000,
 41 |                     help='Max sequence length for tbptt')
 42 | parser.add_argument('--batch-size', type=int, default=64,
 43 |                     help='Batch size')
 44 | parser.add_argument('--lr', type=float, default=1e-4,
 45 |                     help='Learning rate')
 46 | parser.add_argument('--clip-grad', type=float, default=0.5,
 47 |                     help='maximum norm of gradient clipping')
 48 | parser.add_argument('--ignore-grad', type=float, default=10000.0,
 49 |                     help='ignore grad before clipping')
 50 | # Model options
 51 | parser.add_argument('--vocabulary-size', type=int, default=44,
 52 |                     help='Vocabulary size')
 53 | parser.add_argument('--output-size', type=int, default=63,
 54 |                     help='Size of decoder output vector')
 55 | parser.add_argument('--hidden-size', type=int, default=256,
 56 |                     help='Hidden layer size')
 57 | parser.add_argument('--K', type=int, default=10,
 58 |                     help='No. of attention guassians')
 59 | parser.add_argument('--noise', type=int, default=4,
 60 |                     help='Noise level to use')
 61 | parser.add_argument('--attention-alignment', type=float, default=0.05,
 62 |                     help='# of features per letter/phoneme')
 63 | parser.add_argument('--nspk', type=int, default=22,
 64 |                     help='Number of speakers')
 65 | parser.add_argument('--mem-size', type=int, default=20,
 66 |                     help='Memory number of segments')
 67 | 
 68 | 
 69 | # init
 70 | args = parser.parse_args()
 71 | args.expName = os.path.join('checkpoints', args.expName)
 72 | torch.cuda.set_device(args.gpu)
 73 | torch.manual_seed(args.seed)
 74 | torch.cuda.manual_seed(args.seed)
 75 | logging = create_output_dir(args)
 76 | vis = visdom.Visdom(env=args.expName)
 77 | 
 78 | 
 79 | # data
 80 | logging.info("Building dataset.")
 81 | train_dataset = NpzFolder(args.data + '/numpy_features', args.nspk == 1)
 82 | train_loader = NpzLoader(train_dataset,
 83 |                          max_seq_len=args.max_seq_len,
 84 |                          batch_size=args.batch_size,
 85 |                          num_workers=4,
 86 |                          pin_memory=True,
 87 |                          shuffle=True)
 88 | 
 89 | valid_dataset = NpzFolder(args.data + '/numpy_features_valid', args.nspk == 1)
 90 | valid_loader = NpzLoader(valid_dataset,
 91 |                          max_seq_len=args.max_seq_len,
 92 |                          batch_size=args.batch_size,
 93 |                          num_workers=4,
 94 |                          pin_memory=True)
 95 | 
 96 | logging.info("Dataset ready!")
 97 | 
 98 | 
 99 | def train(model, criterion, optimizer, epoch, train_losses):
100 |     total = 0   # Reset every plot_every
101 |     model.train()
102 |     train_enum = tqdm(train_loader, desc='Train epoch %d' % epoch)
103 | 
104 |     for full_txt, full_feat, spkr in train_enum:
105 |         batch_iter = TBPTTIter(full_txt, full_feat, spkr, args.seq_len)
106 |         batch_total = 0
107 | 
108 |         for txt, feat, spkr, start in batch_iter:
109 |             input = wrap(txt)
110 |             target = wrap(feat)
111 |             spkr = wrap(spkr)
112 | 
113 |             # Zero gradients
114 |             if start:
115 |                 optimizer.zero_grad()
116 | 
117 |             # Forward
118 |             output, _ = model([input, spkr], target[0], start)
119 |             loss = criterion(output, target[0], target[1])
120 | 
121 |             # Backward
122 |             loss.backward()
123 |             if check_grad(model.parameters(), args.clip_grad, args.ignore_grad):
124 |                 logging.info('Not a finite gradient or too big, ignoring.')
125 |                 optimizer.zero_grad()
126 |                 continue
127 |             optimizer.step()
128 | 
129 |             # Keep track of loss
130 |             batch_total += loss.data[0]
131 | 
132 |         batch_total = batch_total/len(batch_iter)
133 |         total += batch_total
134 |         train_enum.set_description('Train (loss %.2f) epoch %d' %
135 |                                    (batch_total, epoch))
136 | 
137 |     avg = total / len(train_loader)
138 |     train_losses.append(avg)
139 |     if args.visualize:
140 |         vis.line(Y=np.asarray(train_losses),
141 |                  X=torch.arange(1, 1 + len(train_losses)),
142 |                  opts=dict(title="Train"),
143 |                  win='Train loss ' + args.expName)
144 | 
145 |     logging.info('====> Train set loss: {:.4f}'.format(avg))
146 | 
147 | 
148 | def evaluate(model, criterion, epoch, eval_losses):
149 |     total = 0
150 |     valid_enum = tqdm(valid_loader, desc='Valid epoch %d' % epoch)
151 | 
152 |     for txt, feat, spkr in valid_enum:
153 |         input = wrap(txt, volatile=True)
154 |         target = wrap(feat, volatile=True)
155 |         spkr = wrap(spkr, volatile=True)
156 | 
157 |         output, _ = model([input, spkr], target[0])
158 |         loss = criterion(output, target[0], target[1])
159 | 
160 |         total += loss.data[0]
161 | 
162 |         valid_enum.set_description('Valid (loss %.2f) epoch %d' %
163 |                                    (loss.data[0], epoch))
164 | 
165 |     avg = total / len(valid_loader)
166 |     eval_losses.append(avg)
167 |     if args.visualize:
168 |         vis.line(Y=np.asarray(eval_losses),
169 |                  X=torch.arange(1, 1 + len(eval_losses)),
170 |                  opts=dict(title="Eval"),
171 |                  win='Eval loss ' + args.expName)
172 | 
173 |     logging.info('====> Test set loss: {:.4f}'.format(avg))
174 |     return avg
175 | 
176 | 
177 | def main():
178 |     start_epoch = 1
179 |     model = Loop(args)
180 |     model.cuda()
181 | 
182 |     if args.checkpoint != '':
183 |         checkpoint_args_path = os.path.dirname(args.checkpoint) + '/args.pth'
184 |         checkpoint_args = torch.load(checkpoint_args_path)
185 | 
186 |         start_epoch = checkpoint_args[3]
187 |         model.load_state_dict(torch.load(args.checkpoint))
188 | 
189 |     criterion = MaskedMSE().cuda()
190 |     optimizer = optim.Adam(model.parameters(), lr=args.lr)
191 | 
192 |     # Keep track of losses
193 |     train_losses = []
194 |     eval_losses = []
195 |     best_eval = float('inf')
196 | 
197 |     # Begin!
198 |     for epoch in range(start_epoch, start_epoch + args.epochs):
199 |         train(model, criterion, optimizer, epoch, train_losses)
200 |         eval_loss = evaluate(model, criterion, epoch, eval_losses)
201 |         if eval_loss < best_eval:
202 |             torch.save(model.state_dict(), '%s/bestmodel.pth' % (args.expName))
203 |             best_eval = eval_loss
204 | 
205 |         torch.save(model.state_dict(), '%s/lastmodel.pth' % (args.expName))
206 |         torch.save([args, train_losses, eval_losses, epoch],
207 |                    '%s/args.pth' % (args.expName))
208 | 
209 | 
210 | if __name__ == '__main__':
211 |     main()
212 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | 
  7 | from __future__ import print_function
  8 | import os
  9 | import logging
 10 | import numpy
 11 | import subprocess
 12 | import time
 13 | from datetime import timedelta
 14 | 
 15 | 
 16 | import torch
 17 | from torch.autograd import Variable
 18 | 
 19 | 
 20 | class LogFormatter():
 21 |     def __init__(self):
 22 |         self.start_time = time.time()
 23 | 
 24 |     def format(self, record):
 25 |         elapsed_seconds = round(record.created - self.start_time)
 26 | 
 27 |         prefix = "%s - %s - %s" % (
 28 |             record.levelname,
 29 |             time.strftime('%x %X'),
 30 |             timedelta(seconds=elapsed_seconds)
 31 |         )
 32 |         message = record.getMessage()
 33 |         message = message.replace('\n', '\n' + ' ' * (len(prefix) + 3))
 34 |         return "%s - %s" % (prefix, message)
 35 | 
 36 | 
 37 | def create_output_dir(opt):
 38 |     filepath = os.path.join(opt.expName, 'main.log')
 39 | 
 40 |     if not os.path.exists(opt.expName):
 41 |         os.makedirs(opt.expName)
 42 | 
 43 |     # Safety check
 44 |     if os.path.exists(filepath) and opt.checkpoint == "":
 45 |         logging.warning("Experiment already exists!")
 46 | 
 47 |     # Create logger
 48 |     log_formatter = LogFormatter()
 49 | 
 50 |     # create file handler and set level to debug
 51 |     file_handler = logging.FileHandler(filepath, "a")
 52 |     file_handler.setLevel(logging.DEBUG)
 53 |     file_handler.setFormatter(log_formatter)
 54 | 
 55 |     # create console handler and set level to info
 56 |     console_handler = logging.StreamHandler()
 57 |     console_handler.setLevel(logging.INFO)
 58 |     console_handler.setFormatter(log_formatter)
 59 | 
 60 |     # create logger and set level to debug
 61 |     logger = logging.getLogger()
 62 |     logger.handlers = []
 63 |     logger.setLevel(logging.DEBUG)
 64 |     logger.propagate = False
 65 |     logger.addHandler(file_handler)
 66 |     logger.addHandler(console_handler)
 67 | 
 68 |     # quite down visdom
 69 |     logging.getLogger("requests").setLevel(logging.CRITICAL)
 70 |     logging.getLogger("urllib3").setLevel(logging.CRITICAL)
 71 | 
 72 |     # reset logger elapsed time
 73 |     def reset_time():
 74 |         log_formatter.start_time = time.time()
 75 |     logger.reset_time = reset_time
 76 | 
 77 |     logger.info(opt)
 78 |     return logger
 79 | 
 80 | 
 81 | def wrap(data, **kwargs):
 82 |     if torch.is_tensor(data):
 83 |         var = Variable(data, **kwargs).cuda()
 84 |         return var
 85 |     else:
 86 |         return tuple([wrap(x, **kwargs) for x in data])
 87 | 
 88 | 
 89 | def check_grad(params, clip_th, ignore_th):
 90 |     befgad = torch.nn.utils.clip_grad_norm(params, clip_th)
 91 |     return (not numpy.isfinite(befgad) or (befgad > ignore_th))
 92 | 
 93 | 
 94 | # Code taken from kastnerkyle gist:
 95 | # https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300
 96 | 
 97 | # Convenience function to reuse the defined env
 98 | def pwrap(args, shell=False):
 99 |     p = subprocess.Popen(args, shell=shell, stdout=subprocess.PIPE,
100 |                          stdin=subprocess.PIPE, stderr=subprocess.PIPE,
101 |                          universal_newlines=True)
102 |     return p
103 | 
104 | # Print output
105 | # http://stackoverflow.com/questions/4417546/constantly-print-subprocess-output-while-process-is-running
106 | def execute(cmd, shell=False):
107 |     popen = pwrap(cmd, shell=shell)
108 |     for stdout_line in iter(popen.stdout.readline, ""):
109 |         yield stdout_line
110 | 
111 |     popen.stdout.close()
112 |     return_code = popen.wait()
113 |     if return_code:
114 |         raise subprocess.CalledProcessError(return_code, cmd)
115 | 
116 | 
117 | def pe(cmd, shell=False):
118 |     """
119 |     Print and execute command on system
120 |     """
121 |     for line in execute(cmd, shell=shell):
122 |         print(line, end="")
123 | 
124 | 
125 | def array_to_binary_file(data, output_file_name):
126 |     data = numpy.array(data, 'float32')
127 |     fid = open(output_file_name, 'wb')
128 |     data.tofile(fid)
129 |     fid.close()
130 | 
131 | 
132 | def load_binary_file_frame(file_name, dimension):
133 |     fid_lab = open(file_name, 'rb')
134 |     features = numpy.fromfile(fid_lab, dtype=numpy.float32)
135 |     fid_lab.close()
136 |     assert features.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension)
137 |     frame_number = features.size / dimension
138 |     features = features[:(dimension * frame_number)]
139 |     features = features.reshape((-1, dimension))
140 |     return features, frame_number
141 | 
142 | 
143 | def generate_merlin_wav(
144 |         data, gen_dir, file_basename, norm_info_file,
145 |         do_post_filtering=True, mgc_dim=60, fl=1024, sr=16000):
146 |     # Made from Jose's code and Merlin
147 |     gen_dir = os.path.abspath(gen_dir) + "/"
148 |     if file_basename is None:
149 |         base = "tmp_gen_wav"
150 |     else:
151 |         base = file_basename
152 |     if not os.path.exists(gen_dir):
153 |         os.mkdir(gen_dir)
154 | 
155 |     file_name = os.path.join(gen_dir, base + ".cmp")
156 |     fid = open(norm_info_file, 'rb')
157 |     cmp_info = numpy.fromfile(fid, dtype=numpy.float32)
158 |     fid.close()
159 |     cmp_info = cmp_info.reshape((2, -1))
160 |     cmp_mean = cmp_info[0, ]
161 |     cmp_std = cmp_info[1, ]
162 | 
163 |     data = data * cmp_std + cmp_mean
164 | 
165 |     array_to_binary_file(data, file_name)
166 |     # This code was adapted from Merlin. All licenses apply
167 | 
168 |     out_dimension_dict = {'bap': 1, 'lf0': 1, 'mgc': 60, 'vuv': 1}
169 |     stream_start_index = {}
170 |     file_extension_dict = {
171 |         'mgc': '.mgc', 'bap': '.bap', 'lf0': '.lf0',
172 |         'dur': '.dur', 'cmp': '.cmp'}
173 |     gen_wav_features = ['mgc', 'lf0', 'bap']
174 | 
175 |     dimension_index = 0
176 |     for feature_name in out_dimension_dict.keys():
177 |         stream_start_index[feature_name] = dimension_index
178 |         dimension_index += out_dimension_dict[feature_name]
179 | 
180 |     dir_name = os.path.dirname(file_name)
181 |     file_id = os.path.splitext(os.path.basename(file_name))[0]
182 |     features, frame_number = load_binary_file_frame(file_name, 63)
183 | 
184 |     for feature_name in gen_wav_features:
185 | 
186 |         current_features = features[
187 |             :, stream_start_index[feature_name]:
188 |             stream_start_index[feature_name] +
189 |             out_dimension_dict[feature_name]]
190 | 
191 |         gen_features = current_features
192 | 
193 |         if feature_name in ['lf0', 'F0']:
194 |             if 'vuv' in stream_start_index.keys():
195 |                 vuv_feature = features[
196 |                     :, stream_start_index['vuv']:stream_start_index['vuv'] + 1]
197 | 
198 |                 for i in range(frame_number):
199 |                     if vuv_feature[i, 0] < 0.5:
200 |                         gen_features[i, 0] = -1.0e+10  # self.inf_float
201 | 
202 |         new_file_name = os.path.join(
203 |             dir_name, file_id + file_extension_dict[feature_name])
204 | 
205 |         array_to_binary_file(gen_features, new_file_name)
206 | 
207 |     pf_coef = 1.4
208 |     fw_alpha = 0.58
209 |     co_coef = 511
210 | 
211 |     sptkdir = os.path.abspath(os.path.dirname(__file__) + "/tools/SPTK-3.9/") + '/'
212 |     sptk_path = {
213 |         'SOPR': sptkdir + 'sopr',
214 |         'FREQT': sptkdir + 'freqt',
215 |         'VSTAT': sptkdir + 'vstat',
216 |         'MGC2SP': sptkdir + 'mgc2sp',
217 |         'MERGE': sptkdir + 'merge',
218 |         'BCP': sptkdir + 'bcp',
219 |         'MC2B': sptkdir + 'mc2b',
220 |         'C2ACR': sptkdir + 'c2acr',
221 |         'MLPG': sptkdir + 'mlpg',
222 |         'VOPR': sptkdir + 'vopr',
223 |         'B2MC': sptkdir + 'b2mc',
224 |         'X2X': sptkdir + 'x2x',
225 |         'VSUM': sptkdir + 'vsum'}
226 | 
227 |     worlddir = os.path.abspath(os.path.dirname(__file__) + "/tools/WORLD/") + '/'
228 |     world_path = {
229 |         'ANALYSIS': worlddir + 'analysis',
230 |         'SYNTHESIS': worlddir + 'synth'}
231 | 
232 |     fw_coef = fw_alpha
233 |     fl_coef = fl
234 | 
235 |     files = {'sp': base + '.sp',
236 |              'mgc': base + '.mgc',
237 |              'f0': base + '.f0',
238 |              'lf0': base + '.lf0',
239 |              'ap': base + '.ap',
240 |              'bap': base + '.bap',
241 |              'wav': base + '.wav'}
242 | 
243 |     mgc_file_name = files['mgc']
244 |     cur_dir = os.getcwd()
245 |     os.chdir(gen_dir)
246 | 
247 |     #  post-filtering
248 |     if do_post_filtering:
249 |         line = "echo 1 1 "
250 |         for i in range(2, mgc_dim):
251 |             line = line + str(pf_coef) + " "
252 | 
253 |         pe(
254 |             '{line} | {x2x} +af > {weight}'
255 |             .format(
256 |                 line=line, x2x=sptk_path['X2X'],
257 |                 weight=os.path.join(gen_dir, 'weight')), shell=True)
258 | 
259 |         pe(
260 |             '{freqt} -m {order} -a {fw} -M {co} -A 0 < {mgc} | '
261 |             '{c2acr} -m {co} -M 0 -l {fl} > {base_r0}'
262 |             .format(
263 |                 freqt=sptk_path['FREQT'], order=mgc_dim - 1,
264 |                 fw=fw_coef, co=co_coef, mgc=files['mgc'],
265 |                 c2acr=sptk_path['C2ACR'], fl=fl_coef,
266 |                 base_r0=files['mgc'] + '_r0'), shell=True)
267 | 
268 |         pe(
269 |             '{vopr} -m -n {order} < {mgc} {weight} | '
270 |             '{freqt} -m {order} -a {fw} -M {co} -A 0 | '
271 |             '{c2acr} -m {co} -M 0 -l {fl} > {base_p_r0}'
272 |             .format(
273 |                 vopr=sptk_path['VOPR'], order=mgc_dim - 1,
274 |                 mgc=files['mgc'],
275 |                 weight=os.path.join(gen_dir, 'weight'),
276 |                 freqt=sptk_path['FREQT'], fw=fw_coef, co=co_coef,
277 |                 c2acr=sptk_path['C2ACR'], fl=fl_coef,
278 |                 base_p_r0=files['mgc'] + '_p_r0'), shell=True)
279 | 
280 |         pe(
281 |             '{vopr} -m -n {order} < {mgc} {weight} | '
282 |             '{mc2b} -m {order} -a {fw} | '
283 |             '{bcp} -n {order} -s 0 -e 0 > {base_b0}'
284 |             .format(
285 |                 vopr=sptk_path['VOPR'], order=mgc_dim - 1,
286 |                 mgc=files['mgc'],
287 |                 weight=os.path.join(gen_dir, 'weight'),
288 |                 mc2b=sptk_path['MC2B'], fw=fw_coef,
289 |                 bcp=sptk_path['BCP'], base_b0=files['mgc'] + '_b0'), shell=True)
290 | 
291 |         pe(
292 |             '{vopr} -d < {base_r0} {base_p_r0} | '
293 |             '{sopr} -LN -d 2 | {vopr} -a {base_b0} > {base_p_b0}'
294 |             .format(
295 |                 vopr=sptk_path['VOPR'],
296 |                 base_r0=files['mgc'] + '_r0',
297 |                 base_p_r0=files['mgc'] + '_p_r0',
298 |                 sopr=sptk_path['SOPR'],
299 |                 base_b0=files['mgc'] + '_b0',
300 |                 base_p_b0=files['mgc'] + '_p_b0'), shell=True)
301 | 
302 |         pe(
303 |             '{vopr} -m -n {order} < {mgc} {weight} | '
304 |             '{mc2b} -m {order} -a {fw} | '
305 |             '{bcp} -n {order} -s 1 -e {order} | '
306 |             '{merge} -n {order2} -s 0 -N 0 {base_p_b0} | '
307 |             '{b2mc} -m {order} -a {fw} > {base_p_mgc}'
308 |             .format(
309 |                 vopr=sptk_path['VOPR'], order=mgc_dim - 1,
310 |                 mgc=files['mgc'],
311 |                 weight=os.path.join(gen_dir, 'weight'),
312 |                 mc2b=sptk_path['MC2B'], fw=fw_coef,
313 |                 bcp=sptk_path['BCP'],
314 |                 merge=sptk_path['MERGE'], order2=mgc_dim - 2,
315 |                 base_p_b0=files['mgc'] + '_p_b0',
316 |                 b2mc=sptk_path['B2MC'],
317 |                 base_p_mgc=files['mgc'] + '_p_mgc'), shell=True)
318 | 
319 |         mgc_file_name = files['mgc'] + '_p_mgc'
320 | 
321 |     # Vocoder WORLD
322 | 
323 |     pe(
324 |         '{sopr} -magic -1.0E+10 -EXP -MAGIC 0.0 {lf0} | '
325 |         '{x2x} +fd > {f0}'
326 |         .format(
327 |             sopr=sptk_path['SOPR'], lf0=files['lf0'],
328 |             x2x=sptk_path['X2X'], f0=files['f0']), shell=True)
329 | 
330 |     pe(
331 |         '{sopr} -c 0 {bap} | {x2x} +fd > {ap}'.format(
332 |             sopr=sptk_path['SOPR'], bap=files['bap'],
333 |             x2x=sptk_path['X2X'], ap=files['ap']), shell=True)
334 | 
335 |     pe(
336 |         '{mgc2sp} -a {alpha} -g 0 -m {order} -l {fl} -o 2 {mgc} | '
337 |         '{sopr} -d 32768.0 -P | {x2x} +fd > {sp}'.format(
338 |             mgc2sp=sptk_path['MGC2SP'], alpha=fw_alpha,
339 |             order=mgc_dim - 1, fl=fl, mgc=mgc_file_name,
340 |             sopr=sptk_path['SOPR'], x2x=sptk_path['X2X'], sp=files['sp']),
341 |         shell=True)
342 | 
343 |     pe(
344 |         '{synworld} {fl} {sr} {f0} {sp} {ap} {wav}'.format(
345 |             synworld=world_path['SYNTHESIS'], fl=fl, sr=sr,
346 |             f0=files['f0'], sp=files['sp'], ap=files['ap'],
347 |             wav=files['wav']),
348 |         shell=True)
349 | 
350 |     pe(
351 |         'rm -f {ap} {sp} {f0} {bap} {lf0} {mgc} {mgc}_b0 {mgc}_p_b0 '
352 |         '{mgc}_p_mgc {mgc}_p_r0 {mgc}_r0 {cmp} weight'.format(
353 |             ap=files['ap'], sp=files['sp'], f0=files['f0'],
354 |             bap=files['bap'], lf0=files['lf0'], mgc=files['mgc'],
355 |             cmp=base + '.cmp'),
356 |         shell=True)
357 |     os.chdir(cur_dir)
358 | 


--------------------------------------------------------------------------------