├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING ├── LICENSE ├── README.md ├── data.py ├── generate.py ├── img ├── attn_10.png ├── attn_14.png └── method.png ├── model.py ├── notebooks └── generate.ipynb ├── scripts ├── download_data.sh ├── download_models.sh ├── download_tools.sh └── requirements.txt ├── train.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.pth 3 | *.tar.gz 4 | *.egg-info 5 | models/** 6 | data/** 7 | tools/** 8 | checkpoints/** 9 | notebooks/.ipynb_checkpoints/** 10 | .ipynb_checkpoints/** 11 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please read the [full text](https://code.fb.com/codeofconduct/) so that you can understand what actions will and will not be tolerated. 4 | -------------------------------------------------------------------------------- /CONTRIBUTING: -------------------------------------------------------------------------------- 1 | # Contributing to loop 2 | We want to make contributing to this project as easy and transparent as 3 | possible. 4 | 5 | ## Pull Requests 6 | We actively welcome your pull requests. 7 | 8 | 1. Fork the repo and create your branch from `master`. 9 | 2. If you've added code that should be tested, add tests. 10 | 3. If you've changed APIs, update the documentation. 11 | 4. Ensure the test suite passes. 12 | 5. Make sure your code lints. 13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA"). 14 | 15 | ## Contributor License Agreement ("CLA") 16 | In order to accept your pull request, we need you to submit a CLA. You only need 17 | to do this once to work on any of Facebook's open source projects. 18 | 19 | Complete your CLA here: 20 | 21 | ## Issues 22 | We use GitHub issues to track public bugs. Please ensure your description is 23 | clear and has sufficient instructions to be able to reproduce the issue. 24 | 25 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe 26 | disclosure of security bugs. In those cases, please go through the process 27 | outlined on that page and do not file a public issue. 28 | 29 | ## Coding Style 30 | * 2 spaces for indentation rather than tabs 31 | * 80 character line length 32 | 33 | ## License 34 | By contributing to loop, you agree that your contributions will be licensed 35 | under the LICENSE file in the root directory of this source tree. 36 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution-NonCommercial 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-NonCommercial 4.0 International Public 58 | License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-NonCommercial 4.0 International Public License ("Public 63 | License"). To the extent this Public License may be interpreted as a 64 | contract, You are granted the Licensed Rights in consideration of Your 65 | acceptance of these terms and conditions, and the Licensor grants You 66 | such rights in consideration of benefits the Licensor receives from 67 | making the Licensed Material available under these terms and 68 | conditions. 69 | 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Adapter's License means the license You apply to Your Copyright 84 | and Similar Rights in Your contributions to Adapted Material in 85 | accordance with the terms and conditions of this Public License. 86 | 87 | c. Copyright and Similar Rights means copyright and/or similar rights 88 | closely related to copyright including, without limitation, 89 | performance, broadcast, sound recording, and Sui Generis Database 90 | Rights, without regard to how the rights are labeled or 91 | categorized. For purposes of this Public License, the rights 92 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 93 | Rights. 94 | d. Effective Technological Measures means those measures that, in the 95 | absence of proper authority, may not be circumvented under laws 96 | fulfilling obligations under Article 11 of the WIPO Copyright 97 | Treaty adopted on December 20, 1996, and/or similar international 98 | agreements. 99 | 100 | e. Exceptions and Limitations means fair use, fair dealing, and/or 101 | any other exception or limitation to Copyright and Similar Rights 102 | that applies to Your use of the Licensed Material. 103 | 104 | f. Licensed Material means the artistic or literary work, database, 105 | or other material to which the Licensor applied this Public 106 | License. 107 | 108 | g. Licensed Rights means the rights granted to You subject to the 109 | terms and conditions of this Public License, which are limited to 110 | all Copyright and Similar Rights that apply to Your use of the 111 | Licensed Material and that the Licensor has authority to license. 112 | 113 | h. Licensor means the individual(s) or entity(ies) granting rights 114 | under this Public License. 115 | 116 | i. NonCommercial means not primarily intended for or directed towards 117 | commercial advantage or monetary compensation. For purposes of 118 | this Public License, the exchange of the Licensed Material for 119 | other material subject to Copyright and Similar Rights by digital 120 | file-sharing or similar means is NonCommercial provided there is 121 | no payment of monetary compensation in connection with the 122 | exchange. 123 | 124 | j. Share means to provide material to the public by any means or 125 | process that requires permission under the Licensed Rights, such 126 | as reproduction, public display, public performance, distribution, 127 | dissemination, communication, or importation, and to make material 128 | available to the public including in ways that members of the 129 | public may access the material from a place and at a time 130 | individually chosen by them. 131 | 132 | k. Sui Generis Database Rights means rights other than copyright 133 | resulting from Directive 96/9/EC of the European Parliament and of 134 | the Council of 11 March 1996 on the legal protection of databases, 135 | as amended and/or succeeded, as well as other essentially 136 | equivalent rights anywhere in the world. 137 | 138 | l. You means the individual or entity exercising the Licensed Rights 139 | under this Public License. Your has a corresponding meaning. 140 | 141 | 142 | Section 2 -- Scope. 143 | 144 | a. License grant. 145 | 146 | 1. Subject to the terms and conditions of this Public License, 147 | the Licensor hereby grants You a worldwide, royalty-free, 148 | non-sublicensable, non-exclusive, irrevocable license to 149 | exercise the Licensed Rights in the Licensed Material to: 150 | 151 | a. reproduce and Share the Licensed Material, in whole or 152 | in part, for NonCommercial purposes only; and 153 | 154 | b. produce, reproduce, and Share Adapted Material for 155 | NonCommercial purposes only. 156 | 157 | 2. Exceptions and Limitations. For the avoidance of doubt, where 158 | Exceptions and Limitations apply to Your use, this Public 159 | License does not apply, and You do not need to comply with 160 | its terms and conditions. 161 | 162 | 3. Term. The term of this Public License is specified in Section 163 | 6(a). 164 | 165 | 4. Media and formats; technical modifications allowed. The 166 | Licensor authorizes You to exercise the Licensed Rights in 167 | all media and formats whether now known or hereafter created, 168 | and to make technical modifications necessary to do so. The 169 | Licensor waives and/or agrees not to assert any right or 170 | authority to forbid You from making technical modifications 171 | necessary to exercise the Licensed Rights, including 172 | technical modifications necessary to circumvent Effective 173 | Technological Measures. For purposes of this Public License, 174 | simply making modifications authorized by this Section 2(a) 175 | (4) never produces Adapted Material. 176 | 177 | 5. Downstream recipients. 178 | 179 | a. Offer from the Licensor -- Licensed Material. Every 180 | recipient of the Licensed Material automatically 181 | receives an offer from the Licensor to exercise the 182 | Licensed Rights under the terms and conditions of this 183 | Public License. 184 | 185 | b. No downstream restrictions. You may not offer or impose 186 | any additional or different terms or conditions on, or 187 | apply any Effective Technological Measures to, the 188 | Licensed Material if doing so restricts exercise of the 189 | Licensed Rights by any recipient of the Licensed 190 | Material. 191 | 192 | 6. No endorsement. Nothing in this Public License constitutes or 193 | may be construed as permission to assert or imply that You 194 | are, or that Your use of the Licensed Material is, connected 195 | with, or sponsored, endorsed, or granted official status by, 196 | the Licensor or others designated to receive attribution as 197 | provided in Section 3(a)(1)(A)(i). 198 | 199 | b. Other rights. 200 | 201 | 1. Moral rights, such as the right of integrity, are not 202 | licensed under this Public License, nor are publicity, 203 | privacy, and/or other similar personality rights; however, to 204 | the extent possible, the Licensor waives and/or agrees not to 205 | assert any such rights held by the Licensor to the limited 206 | extent necessary to allow You to exercise the Licensed 207 | Rights, but not otherwise. 208 | 209 | 2. Patent and trademark rights are not licensed under this 210 | Public License. 211 | 212 | 3. To the extent possible, the Licensor waives any right to 213 | collect royalties from You for the exercise of the Licensed 214 | Rights, whether directly or through a collecting society 215 | under any voluntary or waivable statutory or compulsory 216 | licensing scheme. In all other cases the Licensor expressly 217 | reserves any right to collect such royalties, including when 218 | the Licensed Material is used other than for NonCommercial 219 | purposes. 220 | 221 | 222 | Section 3 -- License Conditions. 223 | 224 | Your exercise of the Licensed Rights is expressly made subject to the 225 | following conditions. 226 | 227 | a. Attribution. 228 | 229 | 1. If You Share the Licensed Material (including in modified 230 | form), You must: 231 | 232 | a. retain the following if it is supplied by the Licensor 233 | with the Licensed Material: 234 | 235 | i. identification of the creator(s) of the Licensed 236 | Material and any others designated to receive 237 | attribution, in any reasonable manner requested by 238 | the Licensor (including by pseudonym if 239 | designated); 240 | 241 | ii. a copyright notice; 242 | 243 | iii. a notice that refers to this Public License; 244 | 245 | iv. a notice that refers to the disclaimer of 246 | warranties; 247 | 248 | v. a URI or hyperlink to the Licensed Material to the 249 | extent reasonably practicable; 250 | 251 | b. indicate if You modified the Licensed Material and 252 | retain an indication of any previous modifications; and 253 | 254 | c. indicate the Licensed Material is licensed under this 255 | Public License, and include the text of, or the URI or 256 | hyperlink to, this Public License. 257 | 258 | 2. You may satisfy the conditions in Section 3(a)(1) in any 259 | reasonable manner based on the medium, means, and context in 260 | which You Share the Licensed Material. For example, it may be 261 | reasonable to satisfy the conditions by providing a URI or 262 | hyperlink to a resource that includes the required 263 | information. 264 | 265 | 3. If requested by the Licensor, You must remove any of the 266 | information required by Section 3(a)(1)(A) to the extent 267 | reasonably practicable. 268 | 269 | 4. If You Share Adapted Material You produce, the Adapter's 270 | License You apply must not prevent recipients of the Adapted 271 | Material from complying with this Public License. 272 | 273 | 274 | Section 4 -- Sui Generis Database Rights. 275 | 276 | Where the Licensed Rights include Sui Generis Database Rights that 277 | apply to Your use of the Licensed Material: 278 | 279 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 280 | to extract, reuse, reproduce, and Share all or a substantial 281 | portion of the contents of the database for NonCommercial purposes 282 | only; 283 | 284 | b. if You include all or a substantial portion of the database 285 | contents in a database in which You have Sui Generis Database 286 | Rights, then the database in which You have Sui Generis Database 287 | Rights (but not its individual contents) is Adapted Material; and 288 | 289 | c. You must comply with the conditions in Section 3(a) if You Share 290 | all or a substantial portion of the contents of the database. 291 | 292 | For the avoidance of doubt, this Section 4 supplements and does not 293 | replace Your obligations under this Public License where the Licensed 294 | Rights include other Copyright and Similar Rights. 295 | 296 | 297 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 298 | 299 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 300 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 301 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 302 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 303 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 304 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 305 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 306 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 307 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 308 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 309 | 310 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 311 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 312 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 313 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 314 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 315 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 316 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 317 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 318 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 319 | 320 | c. The disclaimer of warranties and limitation of liability provided 321 | above shall be interpreted in a manner that, to the extent 322 | possible, most closely approximates an absolute disclaimer and 323 | waiver of all liability. 324 | 325 | 326 | Section 6 -- Term and Termination. 327 | 328 | a. This Public License applies for the term of the Copyright and 329 | Similar Rights licensed here. However, if You fail to comply with 330 | this Public License, then Your rights under this Public License 331 | terminate automatically. 332 | 333 | b. Where Your right to use the Licensed Material has terminated under 334 | Section 6(a), it reinstates: 335 | 336 | 1. automatically as of the date the violation is cured, provided 337 | it is cured within 30 days of Your discovery of the 338 | violation; or 339 | 340 | 2. upon express reinstatement by the Licensor. 341 | 342 | For the avoidance of doubt, this Section 6(b) does not affect any 343 | right the Licensor may have to seek remedies for Your violations 344 | of this Public License. 345 | 346 | c. For the avoidance of doubt, the Licensor may also offer the 347 | Licensed Material under separate terms or conditions or stop 348 | distributing the Licensed Material at any time; however, doing so 349 | will not terminate this Public License. 350 | 351 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 352 | License. 353 | 354 | 355 | Section 7 -- Other Terms and Conditions. 356 | 357 | a. The Licensor shall not be bound by any additional or different 358 | terms or conditions communicated by You unless expressly agreed. 359 | 360 | b. Any arrangements, understandings, or agreements regarding the 361 | Licensed Material not stated herein are separate from and 362 | independent of the terms and conditions of this Public License. 363 | 364 | 365 | Section 8 -- Interpretation. 366 | 367 | a. For the avoidance of doubt, this Public License does not, and 368 | shall not be interpreted to, reduce, limit, restrict, or impose 369 | conditions on any use of the Licensed Material that could lawfully 370 | be made without permission under this Public License. 371 | 372 | b. To the extent possible, if any provision of this Public License is 373 | deemed unenforceable, it shall be automatically reformed to the 374 | minimum extent necessary to make it enforceable. If the provision 375 | cannot be reformed, it shall be severed from this Public License 376 | without affecting the enforceability of the remaining terms and 377 | conditions. 378 | 379 | c. No term or condition of this Public License will be waived and no 380 | failure to comply consented to unless expressly agreed to by the 381 | Licensor. 382 | 383 | d. Nothing in this Public License constitutes or may be interpreted 384 | as a limitation upon, or waiver of, any privileges and immunities 385 | that apply to the Licensor or You, including from the legal 386 | processes of any jurisdiction or authority. 387 | 388 | ======================================================================= 389 | 390 | Creative Commons is not a party to its public 391 | licenses. Notwithstanding, Creative Commons may elect to apply one of 392 | its public licenses to material it publishes and in those instances 393 | will be considered the “Licensor.” The text of the Creative Commons 394 | public licenses is dedicated to the public domain under the CC0 Public 395 | Domain Dedication. Except for the limited purpose of indicating that 396 | material is shared under a Creative Commons public license or as 397 | otherwise permitted by the Creative Commons policies published at 398 | creativecommons.org/policies, Creative Commons does not authorize the 399 | use of the trademark "Creative Commons" or any other trademark or logo 400 | of Creative Commons without its prior written consent including, 401 | without limitation, in connection with any unauthorized modifications 402 | to any of its public licenses or any other arrangements, 403 | understandings, or agreements concerning use of licensed material. For 404 | the avoidance of doubt, this paragraph does not form part of the 405 | public licenses. 406 | 407 | Creative Commons may be contacted at creativecommons.org. 408 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # VoiceLoop 2 | PyTorch implementation of the method described in the paper [VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop](https://arxiv.org/abs/1707.06588). 3 | 4 |

5 | 6 | VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled 7 | in the wild. Some demo samples can be [found here](https://ytaigman.github.io/loop/site/). 8 | 9 | ## Quick Links 10 | - [Demo Samples](https://ytaigman.github.io/loop/site/) 11 | - [Quick Start](#quick-start) 12 | - [Setup](#setup) 13 | - [Training](#training) 14 | 15 | ## Quick Start 16 | Follow the instructions in [Setup](#setup) and then simply execute: 17 | ```bash 18 | python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth 19 | ``` 20 | Results will be placed in ```models/vctk/results```. It will generate 2 samples: 21 | * The [generated sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_10.wav) will be saved with the gen_10.wav extension. 22 | * Its [ground-truth (test) sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.orig.wav) is also generated and is saved with the orig.wav extension. 23 | 24 | You can also generate the same text but with a different speaker, specifically: 25 | ```bash 26 | python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth 27 | ``` 28 | Which will generate the following [sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_14.wav). 29 | 30 | Here is the corresponding attention plot: 31 | 32 |

33 | 34 | Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14. 35 | 36 | Finally, free text is also supported: 37 | ```bash 38 | python generate.py --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth 39 | ``` 40 | 41 | ## Setup 42 | Requirements: Linux/OSX, Python2.7 and [PyTorch 0.1.12](http://pytorch.org/). Generation requires installing [phonemizer](https://github.com/bootphon/phonemizer), follow the setup instructions there. 43 | The current version of the code requires CUDA support for training. Generation can be done on the CPU. 44 | 45 | ```bash 46 | git clone https://github.com/facebookresearch/loop.git 47 | cd loop 48 | pip install -r scripts/requirements.txt 49 | ``` 50 | 51 | ### Data 52 | The data used to train the models in the paper can be downloaded via: 53 | ```bash 54 | bash scripts/download_data.sh 55 | ``` 56 | 57 | The script downloads and preprocesses a subset of [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). This subset contains speakers with american accent. 58 | 59 | The dataset was preprocessed using [Merlin](http://www.cstr.ed.ac.uk/projects/merlin/) - from each audio clip we extracted vocoder features using the [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder. After downloading, the dataset will be located under subfolder ```data``` as follows: 60 | 61 | ``` 62 | loop 63 | ├── data 64 |    └── vctk 65 |       ├── norm_info 66 | │ ├── norm.dat 67 |    ├── numpy_feautres 68 | │ ├── p294_001.npz 69 | │ ├── p294_002.npz 70 | │ └── ... 71 |    └── numpy_features_valid 72 | ``` 73 | 74 | The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300. 75 | 76 | ### Pretrained Models 77 | Pretrainde models can be downloaded via: 78 | ```bash 79 | bash scripts/download_models.sh 80 | ``` 81 | After downloading, the models will be located under subfolder ```models``` as follows: 82 | 83 | ``` 84 | loop 85 | ├── data 86 | ├── models 87 | ├── blizzard 88 | ├── vctk 89 | │ ├── args.pth 90 |   │ └── bestmodel.pth 91 | └── vctk_alt 92 | ``` 93 | 94 | **Update 10/25/2017:** Single speaker model available in models/blizzard/ 95 | 96 | ### SPTK and WORLD 97 | Finally, speech generation requires [SPTK3.9](http://sp-tk.sourceforge.net/) and [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder as done in Merlin. To download the executables: 98 | ```bash 99 | bash scripts/download_tools.sh 100 | ``` 101 | Which results the following sub directories: 102 | ``` 103 | loop 104 | ├── data 105 | ├── models 106 | ├── tools 107 |   ├── SPTK-3.9 108 |    └── WORLD 109 | ``` 110 | 111 | ## Training 112 | 113 | ### Single-Speaker 114 | Single speaker model is trained on [blizzard 2011](http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/). Data should be downloaded and prepared as described above. Once the data is ready, run: 115 | ```bash 116 | python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10 117 | ``` 118 | Then, continue training the model with : 119 | ```bash 120 | python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90 121 | ``` 122 | ### Multi-Speaker 123 | Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100: 124 | ```bash 125 | python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90 126 | ``` 127 | Then, continue training the model using noise level of 2, on full sequences: 128 | ```bash 129 | python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90 130 | ``` 131 | 132 | ## Citation 133 | If you find this code useful in your research then please cite: 134 | 135 | ``` 136 | @article{taigman2017voice, 137 | title = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop}, 138 | author = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya}, 139 | journal = {ArXiv e-prints}, 140 | archivePrefix = "arXiv", 141 | eprinttype = {arxiv}, 142 | eprint = {1705.03122}, 143 | primaryClass = "cs.CL", 144 | year = {2017} 145 | month = October, 146 | } 147 | ``` 148 | 149 | ## License 150 | Loop has a CC-BY-NC license. 151 | -------------------------------------------------------------------------------- /data.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | from functools import partial 8 | from collections import defaultdict 9 | import numpy as np 10 | import os 11 | 12 | import torch 13 | import torch.utils.data as data 14 | 15 | 16 | # Taken from 17 | # https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Dataset.py 18 | def batchify(data): 19 | out, lengths = None, None 20 | 21 | lengths = [x.size(0) for x in data] 22 | max_length = max(lengths) 23 | 24 | if data[0].dim() == 1: 25 | out = data[0].new(len(data), max_length).fill_(0) 26 | for i in range(len(data)): 27 | data_length = data[i].size(0) 28 | out[i].narrow(0, 0, data_length).copy_(data[i]) 29 | else: 30 | feat_size = data[0].size(1) 31 | out = data[0].new(len(data), max_length, feat_size).fill_(0) 32 | for i in range(len(data)): 33 | data_length = data[i].size(0) 34 | out[i].narrow(0, 0, data_length).copy_(data[i]) 35 | 36 | return out, lengths 37 | 38 | 39 | def collate_by_input_length(batch, max_seq_len): 40 | "Puts each data field into a tensor with outer dimension batch size" 41 | if torch.is_tensor(batch[0]): 42 | return batchify(batch) 43 | elif isinstance(batch[0], int): 44 | return torch.LongTensor(batch) 45 | else: 46 | new_batch = [x for x in batch if x[1].size(0) < max_seq_len] 47 | if len(batch) == 0: 48 | return (None, None), (None, None), None 49 | 50 | batch = new_batch 51 | transposed = zip(*batch) 52 | (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers = \ 53 | [collate_by_input_length(samples, max_seq_len) 54 | for samples in transposed] 55 | 56 | # within batch sorting by decreasing length for variable length rnns 57 | batch = zip(srcBatch, tgtBatch, tgtLengths, speakers) 58 | batch, srcLengths = zip(*sorted(zip(batch, srcLengths), 59 | key=lambda x: -x[1])) 60 | srcBatch, tgtBatch, tgtLengths, speakers = zip(*batch) 61 | 62 | srcBatch = torch.stack(srcBatch, 0).transpose(0, 1).contiguous() 63 | tgtBatch = torch.stack(tgtBatch, 0).transpose(0, 1).contiguous() 64 | srcLengths = torch.LongTensor(srcLengths) 65 | tgtLengths = torch.LongTensor(tgtLengths) 66 | speakers = torch.LongTensor(speakers).view(-1, 1) 67 | 68 | return (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers 69 | 70 | raise TypeError(("batch must contain tensors, numbers, dicts or \ 71 | lists; found {}".format(type(batch[0])))) 72 | 73 | 74 | class NpzFolder(data.Dataset): 75 | NPZ_EXTENSION = 'npz' 76 | 77 | def __init__(self, root, single_spkr=False): 78 | self.root = root 79 | self.npzs = self.make_dataset(self.root) 80 | 81 | if len(self.npzs) == 0: 82 | raise(RuntimeError("Found 0 npz in subfolders of: " + root + "\n" 83 | "Supported image extensions are: " + 84 | self.NPZ_EXTENSION)) 85 | 86 | if single_spkr: 87 | self.speakers = defaultdict(lambda: 0) 88 | else: 89 | self.speakers = [] 90 | for fname in self.npzs: 91 | self.speakers += [os.path.basename(fname).split('_')[0]] 92 | self.speakers = list(set(self.speakers)) 93 | self.speakers.sort() 94 | self.speakers = {v: i for i, v in enumerate(self.speakers)} 95 | 96 | code2phone = np.load(self.npzs[0])['code2phone'] 97 | self.dict = {v: k for k, v in enumerate(code2phone)} 98 | 99 | def __getitem__(self, index): 100 | path = self.npzs[index] 101 | txt, feat, spkr = self.loader(path) 102 | 103 | return txt, feat, self.speakers[spkr] 104 | 105 | def __len__(self): 106 | return len(self.npzs) 107 | 108 | def make_dataset(self, dir): 109 | images = [] 110 | 111 | for root, _, fnames in sorted(os.walk(dir)): 112 | for fname in fnames: 113 | if self.NPZ_EXTENSION in fname: 114 | path = os.path.join(root, fname) 115 | images.append(path) 116 | 117 | return images 118 | 119 | def loader(self, path): 120 | feat = np.load(path) 121 | 122 | txt = feat['phonemes'].astype('int64') 123 | txt = torch.from_numpy(txt) 124 | 125 | audio = feat['audio_features'] 126 | audio = torch.from_numpy(audio) 127 | 128 | spkr = os.path.basename(path).split('_')[0] 129 | 130 | return txt, audio, spkr 131 | 132 | 133 | class NpzLoader(data.DataLoader): 134 | def __init__(self, *args, **kwargs): 135 | kwargs['collate_fn'] = partial(collate_by_input_length, 136 | max_seq_len=kwargs['max_seq_len']) 137 | del kwargs['max_seq_len'] 138 | 139 | data.DataLoader.__init__(self, *args, **kwargs) 140 | 141 | 142 | class TBPTTIter(object): 143 | """ 144 | Iterator for truncated batch propagation through time(tbptt) training. 145 | Target sequence is segmented while input sequence remains the same. 146 | """ 147 | def __init__(self, src, trgt, spkr, seq_len): 148 | self.seq_len = seq_len 149 | self.start = True 150 | 151 | self.speakers = spkr 152 | self.srcBatch = src[0] 153 | self.srcLenths = src[1] 154 | 155 | # split batch 156 | self.tgtBatch = list(torch.split(trgt[0], self.seq_len, 0)) 157 | self.tgtBatch.reverse() 158 | self.len = len(self.tgtBatch) 159 | 160 | # split length list 161 | batch_seq_len = len(self.tgtBatch) 162 | self.tgtLenths = [self.split_length(l, batch_seq_len) for l in trgt[1]] 163 | self.tgtLenths = torch.stack(self.tgtLenths) 164 | self.tgtLenths = list(torch.split(self.tgtLenths, 1, 1)) 165 | self.tgtLenths = [x.squeeze() for x in self.tgtLenths] 166 | self.tgtLenths.reverse() 167 | 168 | assert len(self.tgtLenths) == len(self.tgtBatch) 169 | 170 | def split_length(self, seq_size, batch_seq_len): 171 | seq = [self.seq_len] * (seq_size / self.seq_len) 172 | if seq_size % self.seq_len != 0: 173 | seq += [seq_size % self.seq_len] 174 | seq += [0] * (batch_seq_len - len(seq)) 175 | return torch.LongTensor(seq) 176 | 177 | def __next__(self): 178 | if len(self.tgtBatch) == 0: 179 | raise StopIteration() 180 | 181 | if self.len > len(self.tgtBatch): 182 | self.start = False 183 | 184 | return (self.srcBatch, self.srcLenths), \ 185 | (self.tgtBatch.pop(), self.tgtLenths.pop()), \ 186 | self.speakers, self.start 187 | 188 | next = __next__ 189 | 190 | def __iter__(self): 191 | return self 192 | 193 | def __len__(self): 194 | return self.len 195 | -------------------------------------------------------------------------------- /generate.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import os 8 | import argparse 9 | import numpy as np 10 | import phonemizer 11 | import string 12 | 13 | import torch 14 | from torch.autograd import Variable 15 | 16 | from model import Loop 17 | from data import NpzFolder 18 | from utils import generate_merlin_wav 19 | 20 | 21 | parser = argparse.ArgumentParser(description='PyTorch Phonological Loop \ 22 | Generation') 23 | parser.add_argument('--npz', type=str, default='', 24 | help='Dataset sample to generate.') 25 | parser.add_argument('--text', default='', 26 | type=str, help='Free text to generate.') 27 | parser.add_argument('--spkr', default=0, 28 | type=int, help='Speaker id.') 29 | parser.add_argument('--checkpoint', default='checkpoints/vctk/lastmodel.pth', 30 | type=str, help='Model used for generation.') 31 | parser.add_argument('--gpu', default=-1, 32 | type=int, help='GPU device ID, use -1 for CPU.') 33 | 34 | 35 | # init 36 | args = parser.parse_args() 37 | if args.gpu >= 0: 38 | torch.cuda.set_device(args.gpu) 39 | 40 | 41 | def text2phone(text, char2code): 42 | seperator = phonemizer.separator.Separator('', '', ' ') 43 | ph = phonemizer.phonemize(text, separator=seperator) 44 | ph = ph.split(' ') 45 | ph.remove('') 46 | 47 | result = [char2code[p] for p in ph] 48 | return torch.LongTensor(result) 49 | 50 | 51 | def trim_pred(out, attn): 52 | tq = attn.abs().sum(1).data 53 | 54 | for stopi in range(1, tq.size(0)): 55 | col_sum = attn[:stopi, :].abs().sum(0).data.squeeze() 56 | if tq[stopi][0] < 0.5 and col_sum[-1] > 4: 57 | break 58 | 59 | out = out[:stopi, :] 60 | attn = attn[:stopi, :] 61 | 62 | return out, attn 63 | 64 | 65 | def npy_loader_phonemes(path): 66 | feat = np.load(path) 67 | 68 | txt = feat['phonemes'].astype('int64') 69 | txt = torch.from_numpy(txt) 70 | 71 | audio = feat['audio_features'] 72 | audio = torch.from_numpy(audio) 73 | 74 | return txt, audio 75 | 76 | 77 | def main(): 78 | weights = torch.load(args.checkpoint, 79 | map_location=lambda storage, loc: storage) 80 | opt = torch.load(os.path.dirname(args.checkpoint) + '/args.pth') 81 | train_args = opt[0] 82 | 83 | char2code = {'aa': 0, 'ae': 1, 'ah': 2, 'ao': 3, 'aw': 4, 'ax': 5, 'ay': 6, 84 | 'b': 7, 'ch': 8, 'd': 9, 'dh': 10, 'eh': 11, 'er': 12, 'ey': 13, 85 | 'f': 14, 'g': 15, 'hh': 16, 'i': 17, 'ih': 18, 'iy': 19, 'jh': 20, 86 | 'k': 21, 'l': 22, 'm': 23, 'n': 24, 'ng': 25, 'ow': 26, 'oy': 27, 87 | 'p': 28, 'pau': 29, 'r': 30, 's': 31, 'sh': 32, 'ssil': 33, 88 | 't': 34, 'th': 35, 'uh': 36, 'uw': 37, 'v': 38, 'w': 39, 'y': 40, 89 | 'z': 41} 90 | nspkr = train_args.nspk 91 | 92 | norm_path = None 93 | if os.path.exists(train_args.data + '/norm_info/norm.dat'): 94 | norm_path = train_args.data + '/norm_info/norm.dat' 95 | elif os.path.exists(os.path.dirname(args.checkpoint) + '/norm.dat'): 96 | norm_path = os.path.dirname(args.checkpoint) + '/norm.dat' 97 | else: 98 | print('ERROR: Failed to find norm file.') 99 | return 100 | train_args.noise = 0 101 | 102 | model = Loop(train_args) 103 | model.load_state_dict(weights) 104 | if args.gpu >= 0: 105 | model.cuda() 106 | model.eval() 107 | 108 | if args.spkr not in range(nspkr): 109 | print('ERROR: Unknown speaker id: %d.' % args.spkr) 110 | return 111 | 112 | txt, feat, spkr, output_fname = None, None, None, None 113 | if args.npz is not '': 114 | txt, feat = npy_loader_phonemes(args.npz) 115 | 116 | txt = Variable(txt.unsqueeze(1), volatile=True) 117 | feat = Variable(feat.unsqueeze(1), volatile=True) 118 | spkr = Variable(torch.LongTensor([args.spkr]), volatile=True) 119 | 120 | fname = os.path.basename(args.npz)[:-4] 121 | output_fname = fname + '.gen_' + str(args.spkr) 122 | elif args.text is not '': 123 | txt = text2phone(args.text, char2code) 124 | feat = torch.FloatTensor(txt.size(0)*20, 63) 125 | spkr = torch.LongTensor([args.spkr]) 126 | 127 | txt = Variable(txt.unsqueeze(1), volatile=True) 128 | feat = Variable(feat.unsqueeze(1), volatile=True) 129 | spkr = Variable(spkr, volatile=True) 130 | 131 | # slugify input string to file name 132 | fname = args.text.replace(' ', '_') 133 | valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits) 134 | fname = ''.join(c for c in fname if c in valid_chars) 135 | 136 | output_fname = fname + '.gen_' + str(args.spkr) 137 | else: 138 | print('ERROR: Must supply npz file path or text as source.') 139 | return 140 | 141 | if args.gpu >= 0: 142 | txt = txt.cuda() 143 | feat = feat.cuda() 144 | spkr = spkr.cuda() 145 | 146 | 147 | out, attn = model([txt, spkr], feat) 148 | out, attn = trim_pred(out, attn) 149 | 150 | output_dir = os.path.join(os.path.dirname(args.checkpoint), 'results') 151 | if not os.path.exists(output_dir): 152 | os.makedirs(output_dir) 153 | 154 | generate_merlin_wav(out.data.cpu().numpy(), 155 | output_dir, 156 | output_fname, 157 | norm_path) 158 | 159 | if args.npz is not '': 160 | output_orig_fname = os.path.basename(args.npz)[:-4] + '.orig' 161 | generate_merlin_wav(feat[:, 0, :].data.cpu().numpy(), 162 | output_dir, 163 | output_orig_fname, 164 | norm_path) 165 | 166 | 167 | if __name__ == '__main__': 168 | main() 169 | -------------------------------------------------------------------------------- /img/attn_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookarchive/loop/112975599b1838a33f139a6c6df0fcd9953ee33f/img/attn_10.png -------------------------------------------------------------------------------- /img/attn_14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookarchive/loop/112975599b1838a33f139a6c6df0fcd9953ee33f/img/attn_14.png -------------------------------------------------------------------------------- /img/method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookarchive/loop/112975599b1838a33f139a6c6df0fcd9953ee33f/img/method.png -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import torch 8 | import torch.nn as nn 9 | from torch.autograd import Variable 10 | from torch.nn.utils.rnn import pad_packed_sequence as unpack 11 | from torch.nn.utils.rnn import pack_padded_sequence as pack 12 | 13 | 14 | def getLinear(dim_in, dim_out): 15 | return nn.Sequential(nn.Linear(dim_in, dim_in/10), 16 | nn.ReLU(), 17 | nn.Linear(dim_in/10, dim_out)) 18 | 19 | 20 | class MaskedMSE(nn.Module): 21 | def __init__(self): 22 | super(MaskedMSE, self).__init__() 23 | self.criterion = nn.MSELoss(size_average=False) 24 | 25 | # Taken from 26 | # https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation 27 | @staticmethod 28 | def _sequence_mask(sequence_length, max_len): 29 | batch_size = sequence_length.size(0) 30 | seq_range = torch.arange(0, max_len).long() 31 | seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len) 32 | seq_range_expand = Variable(seq_range_expand) 33 | if sequence_length.is_cuda: 34 | seq_range_expand = seq_range_expand.cuda() 35 | seq_length_expand = sequence_length.unsqueeze(1) \ 36 | .expand_as(seq_range_expand) 37 | return (seq_range_expand < seq_length_expand).t().float() 38 | 39 | def forward(self, input, target, lengths): 40 | max_len = input.size(0) 41 | mask = self._sequence_mask(lengths, max_len).unsqueeze(2) 42 | mask_ = mask.expand_as(input) 43 | self.loss = self.criterion(input*mask_, target*mask_) 44 | self.loss = self.loss / mask.sum() 45 | return self.loss 46 | 47 | 48 | class Encoder(nn.Module): 49 | def __init__(self, opt): 50 | super(Encoder, self).__init__() 51 | self.hidden_size = opt.hidden_size 52 | self.vocabulary_size = opt.vocabulary_size 53 | self.nspk = opt.nspk 54 | self.lut_p = nn.Embedding(self.vocabulary_size, 55 | self.hidden_size, 56 | max_norm=1.0) 57 | self.lut_s = nn.Embedding(self.nspk, 58 | self.hidden_size, 59 | max_norm=1.0) 60 | 61 | def forward(self, input, speakers): 62 | if isinstance(input, tuple): 63 | lengths = input[1].data.view(-1).tolist() 64 | outputs = pack(self.lut_p(input[0]), lengths) 65 | else: 66 | outputs = self.lut_p(input) 67 | if isinstance(input, tuple): 68 | outputs = unpack(outputs)[0] 69 | 70 | ident = self.lut_s(speakers) 71 | if ident.dim() == 3: 72 | ident = ident.squeeze(1) 73 | 74 | return outputs, ident 75 | 76 | 77 | class GravesAttention(nn.Module): 78 | COEF = 0.3989422917366028 # numpy.sqrt(1/(2*numpy.pi)) 79 | 80 | def __init__(self, batch_size, mem_elem, K, attention_alignment): 81 | super(GravesAttention, self).__init__() 82 | self.K = K 83 | self.attention_alignment = attention_alignment 84 | self.epsilon = 1e-5 85 | 86 | self.sm = nn.Softmax() 87 | self.N_a = getLinear(mem_elem, 3*K) 88 | self.J = Variable(torch.arange(0, 500) 89 | .expand_as(torch.Tensor(batch_size, 90 | self.K, 91 | 500)), 92 | requires_grad=False) 93 | 94 | def forward(self, C, context, mu_tm1): 95 | gbk_t = self.N_a(C.view(C.size(0), C.size(1) * C.size(2))) 96 | gbk_t = gbk_t.view(gbk_t.size(0), -1, self.K) 97 | 98 | # attention model parameters 99 | g_t = gbk_t[:, 0, :] 100 | b_t = gbk_t[:, 1, :] 101 | k_t = gbk_t[:, 2, :] 102 | 103 | # attention GMM parameters 104 | g_t = self.sm(g_t) + self.epsilon 105 | sig_t = torch.exp(b_t) + self.epsilon 106 | mu_t = mu_tm1 + self.attention_alignment * torch.exp(k_t) 107 | 108 | g_t = g_t.unsqueeze(2).expand(g_t.size(0), 109 | g_t.size(1), 110 | context.size(1)) 111 | sig_t = sig_t.unsqueeze(2).expand_as(g_t) 112 | mu_t_ = mu_t.unsqueeze(2).expand_as(g_t) 113 | j = self.J[:g_t.size(0), :, :context.size(1)] 114 | 115 | # attention weights 116 | phi_t = g_t * torch.exp(-0.5 * sig_t * (mu_t_ - j)**2) 117 | alpha_t = self.COEF * torch.sum(phi_t, 1) 118 | 119 | c_t = torch.bmm(alpha_t, context).transpose(0, 1).squeeze(0) 120 | return c_t, mu_t, alpha_t 121 | 122 | 123 | class Decoder(nn.Module): 124 | def __init__(self, opt): 125 | super(Decoder, self).__init__() 126 | self.K = opt.K 127 | self.hidden_size = opt.hidden_size 128 | self.output_size = opt.output_size 129 | 130 | self.mem_size = opt.mem_size 131 | self.mem_feat_size = opt.output_size + opt.hidden_size 132 | self.mem_elem = self.mem_size * self.mem_feat_size 133 | 134 | self.attn = GravesAttention(opt.batch_size, 135 | self.mem_elem, 136 | self.K, 137 | opt.attention_alignment) 138 | 139 | self.N_o = getLinear(self.mem_elem, self.hidden_size) 140 | self.output = nn.Linear(self.hidden_size, self.output_size) 141 | self.N_u = getLinear(self.mem_elem, self.mem_feat_size) 142 | 143 | self.F_u = nn.Linear(self.hidden_size, self.hidden_size) 144 | self.F_o = nn.Linear(self.hidden_size, self.hidden_size) 145 | 146 | def init_buffer(self, ident, start=True): 147 | mem_feat_size = self.hidden_size + self.output_size 148 | batch_size = ident.size(0) 149 | 150 | if start: 151 | self.mu_t = Variable(ident.data.new(batch_size, self.K).zero_()) 152 | self.S_t = Variable(ident.data.new(batch_size, 153 | mem_feat_size, 154 | self.mem_size).zero_()) 155 | 156 | # initialize with identity 157 | self.S_t[:, :self.hidden_size, :] = ident.unsqueeze(2) \ 158 | .expand(ident.size(0), 159 | ident.size(1), 160 | self.mem_size) 161 | else: 162 | self.mu_t = self.mu_t.detach() 163 | self.S_t = self.S_t.detach() 164 | 165 | def update_buffer(self, S_tm1, c_t, o_tm1, ident): 166 | # concat previous output & context 167 | idt = torch.tanh(self.F_u(ident)) 168 | o_tm1 = o_tm1.squeeze(0) 169 | z_t = torch.cat([c_t + idt, o_tm1/30], 1) 170 | z_t = z_t.unsqueeze(2) 171 | Sp = torch.cat([z_t, S_tm1[:, :, :-1]], 2) 172 | 173 | # update S 174 | u = self.N_u(Sp.view(Sp.size(0), -1)) 175 | u[:, :idt.size(1)] = u[:, :idt.size(1)] + idt 176 | u = u.unsqueeze(2) 177 | S = torch.cat([u, S_tm1[:, :, :-1]], 2) 178 | 179 | return S 180 | 181 | def forward(self, x, ident, context, start=True): 182 | out, attns = [], [] 183 | o_t = x[0] 184 | self.init_buffer(ident, start) 185 | 186 | for o_tm1 in torch.split(x, 1): 187 | if not self.training: 188 | o_tm1 = o_t.unsqueeze(0) 189 | 190 | # predict weighted context based on S 191 | c_t, mu_t, alpha_t = self.attn(self.S_t, 192 | context.transpose(0, 1), 193 | self.mu_t) 194 | 195 | # advance mu and update buffer 196 | self.S_t = self.update_buffer(self.S_t, c_t, o_tm1, ident) 197 | self.mu_t = mu_t 198 | 199 | # predict next time step based on buffer content 200 | ot_out = self.N_o(self.S_t.view(self.S_t.size(0), -1)) 201 | sp_out = self.F_o(ident) 202 | o_t = self.output(ot_out + sp_out) 203 | 204 | out += [o_t] 205 | attns += [alpha_t.squeeze()] 206 | 207 | out_seq = torch.stack(out) 208 | attns_seq = torch.stack(attns) 209 | 210 | return out_seq, attns_seq 211 | 212 | 213 | class Loop(nn.Module): 214 | def __init__(self, opt): 215 | super(Loop, self).__init__() 216 | self.encoder = Encoder(opt) 217 | self.decoder = Decoder(opt) 218 | self.noise = opt.noise 219 | self.output_size = opt.output_size 220 | 221 | def init_input(self, tgt, start): 222 | if start: 223 | self.x_tm1 = torch.zeros(1, tgt.size(1), tgt.size(2)).type_as(tgt.data) 224 | 225 | if tgt.size(0) > 1: 226 | inp = torch.cat([self.x_tm1, tgt[:-1].data]) 227 | else: 228 | inp = self.x_tm1 229 | 230 | if self.noise > 0: 231 | noise = tgt.data.new(inp.size()).normal_(0, self.noise) 232 | inp += noise 233 | 234 | if not self.training: 235 | inp.zero_() 236 | 237 | self.x_tm1 = tgt[-1].data.unsqueeze(0) 238 | return Variable(inp) 239 | 240 | def cuda(self, device_id=None): 241 | nn.Module.cuda(self, device_id) 242 | self.decoder.attn.J = self.decoder.attn.J.cuda(device_id) 243 | 244 | def forward(self, src, tgt, start=True): 245 | x = self.init_input(tgt, start) 246 | 247 | context, ident = self.encoder(src[0], src[1]) 248 | out, attn = self.decoder(x, ident, context, start) 249 | 250 | return out, attn 251 | -------------------------------------------------------------------------------- /notebooks/generate.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys\n", 12 | "sys.path.append('..')\n", 13 | "import os\n", 14 | "import shutil\n", 15 | "import torch\n", 16 | "\n", 17 | "import matplotlib\n", 18 | "import matplotlib.cm as cm\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "\n", 21 | "import numpy\n", 22 | "%matplotlib notebook\n", 23 | "\n", 24 | "from data import *\n", 25 | "from model import Loop\n", 26 | "from utils import generate_merlin_wav\n", 27 | "\n", 28 | "from torch.autograd import Variable\n", 29 | "from IPython.display import Audio\n", 30 | "\n", 31 | "def plot(data, labels, dict_file):\n", 32 | " labels_dict = dict_file\n", 33 | " labels_dict = {v: k for k, v in labels_dict.iteritems()}\n", 34 | " labels = [labels_dict[x].decode('latin-1') for x in labels]\n", 35 | "\n", 36 | " axarr = plt.subplot()\n", 37 | " axarr.imshow(data.T, aspect='auto', origin='lower', interpolation='nearest', cmap=cm.viridis)\n", 38 | " axarr.set_yticks(numpy.arange(0, len(data.T)))\n", 39 | " axarr.set_yticklabels(labels, rotation=90)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "collapsed": true 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "data_path = os.path.abspath('../data/vctk/numpy_features')\n", 51 | "norm_info = os.path.abspath('../data/vctk/norm_info/norm.dat')\n", 52 | " \n", 53 | "train_dataset = NpzFolder(data_path)\n", 54 | "valid_dataset = NpzFolder(data_path + '_valid')" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "torch.cuda.set_device(1)\n", 64 | "\n", 65 | "checkpoint = '../models/vctk'\n", 66 | "weights = torch.load(checkpoint + '/bestmodel.pth')\n", 67 | "\n", 68 | "args = torch.load(checkpoint + '/args.pth')\n", 69 | "opt = args[0]\n", 70 | "opt.noise = 0\n", 71 | "\n", 72 | "model = Loop(opt)\n", 73 | "model.load_state_dict(weights)\n", 74 | "model.cuda();\n", 75 | "model.eval();" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": { 82 | "scrolled": false 83 | }, 84 | "outputs": [], 85 | "source": [ 86 | "ID = 5\n", 87 | "txt, feat, _ = valid_dataset[8]\n", 88 | "\n", 89 | "txt = Variable(txt.unsqueeze(1), volatile=True).cuda()\n", 90 | "feat = Variable(feat.unsqueeze(1), volatile=True).cuda()\n", 91 | "spkr = Variable(torch.LongTensor([ID]), volatile=True).cuda()\n", 92 | "\n", 93 | "out, attn = model([txt, spkr], feat)\n", 94 | "\n", 95 | "generate_merlin_wav(out.data.cpu().numpy(), \"/tmp/gen\", file_basename='test',\n", 96 | " norm_info_file=norm_info, do_post_filtering=True)\n", 97 | "Audio('/tmp/gen/test.wav')" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "plot(attn.squeeze().data.cpu().numpy(), txt[:,0].squeeze().data.tolist(), valid_dataset.dict)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "generate_merlin_wav(feat.data.cpu().numpy(), \"/tmp/gen\", file_basename='test',\n", 116 | " norm_info_file=norm_info, do_post_filtering=True)\n", 117 | "Audio('/tmp/gen/test.wav')" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": { 124 | "collapsed": true 125 | }, 126 | "outputs": [], 127 | "source": [] 128 | } 129 | ], 130 | "metadata": { 131 | "kernelspec": { 132 | "display_name": "Python 2", 133 | "language": "python", 134 | "name": "python2" 135 | }, 136 | "language_info": { 137 | "codemirror_mode": { 138 | "name": "ipython", 139 | "version": 2 140 | }, 141 | "file_extension": ".py", 142 | "mimetype": "text/x-python", 143 | "name": "python", 144 | "nbconvert_exporter": "python", 145 | "pygments_lexer": "ipython2", 146 | "version": "2.7.13" 147 | } 148 | }, 149 | "nbformat": 4, 150 | "nbformat_minor": 2 151 | } 152 | -------------------------------------------------------------------------------- /scripts/download_data.sh: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | mkdir data 8 | pushd data 9 | wget https://dl.fbaipublicfiles.com/loop/vctk_data.zip 10 | unzip vctk_data.zip 11 | rm vctk_data.zip 12 | -------------------------------------------------------------------------------- /scripts/download_models.sh: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | mkdir -p models 8 | pushd models 9 | 10 | wget https://dl.fbaipublicfiles.com/loop/vctk_model.zip 11 | unzip vctk_model.zip 12 | rm vctk_model.zip 13 | 14 | wget https://dl.fbaipublicfiles.com/loop/vctk_alt_model.zip 15 | unzip vctk_alt_model.zip 16 | rm vctk_alt_model.zip 17 | 18 | wget https://dl.fbaipublicfiles.com/loop/blizzard_model.zip 19 | unzip blizzard_model.zip 20 | rm blizzard_model.zip 21 | 22 | popd 23 | -------------------------------------------------------------------------------- /scripts/download_tools.sh: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | 8 | echo "Downloading merlin" 9 | git clone https://github.com/CSTR-Edinburgh/merlin 10 | 11 | pushd merlin/tools 12 | ./compile_tools.sh 13 | popd 14 | 15 | mv merlin/tools/bin tools 16 | rm -rf merlin 17 | -------------------------------------------------------------------------------- /scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | visdom 2 | numpy 3 | tqdm 4 | scipy 5 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import os 8 | import argparse 9 | import visdom 10 | import numpy as np 11 | from tqdm import tqdm 12 | 13 | import torch 14 | import torch.optim as optim 15 | 16 | from data import NpzFolder, NpzLoader, TBPTTIter 17 | from model import Loop, MaskedMSE 18 | from utils import create_output_dir, wrap, check_grad 19 | 20 | 21 | parser = argparse.ArgumentParser(description='PyTorch Loop') 22 | # Env options: 23 | parser.add_argument('--epochs', type=int, default=92, metavar='N', 24 | help='number of epochs to train (default: 92)') 25 | parser.add_argument('--seed', type=int, default=1, metavar='S', 26 | help='random seed (default: 1)') 27 | parser.add_argument('--expName', type=str, default='vctk', metavar='E', 28 | help='Experiment name') 29 | parser.add_argument('--data', default='data/vctk', 30 | metavar='D', type=str, help='Data path') 31 | parser.add_argument('--checkpoint', default='', 32 | metavar='C', type=str, help='Checkpoint path') 33 | parser.add_argument('--gpu', default=0, 34 | metavar='G', type=int, help='GPU device ID') 35 | parser.add_argument('--visualize', action='store_true', 36 | help='Visualize train and validation loss.') 37 | # Data options 38 | parser.add_argument('--seq-len', type=int, default=100, 39 | help='Sequence length for tbptt') 40 | parser.add_argument('--max-seq-len', type=int, default=1000, 41 | help='Max sequence length for tbptt') 42 | parser.add_argument('--batch-size', type=int, default=64, 43 | help='Batch size') 44 | parser.add_argument('--lr', type=float, default=1e-4, 45 | help='Learning rate') 46 | parser.add_argument('--clip-grad', type=float, default=0.5, 47 | help='maximum norm of gradient clipping') 48 | parser.add_argument('--ignore-grad', type=float, default=10000.0, 49 | help='ignore grad before clipping') 50 | # Model options 51 | parser.add_argument('--vocabulary-size', type=int, default=44, 52 | help='Vocabulary size') 53 | parser.add_argument('--output-size', type=int, default=63, 54 | help='Size of decoder output vector') 55 | parser.add_argument('--hidden-size', type=int, default=256, 56 | help='Hidden layer size') 57 | parser.add_argument('--K', type=int, default=10, 58 | help='No. of attention guassians') 59 | parser.add_argument('--noise', type=int, default=4, 60 | help='Noise level to use') 61 | parser.add_argument('--attention-alignment', type=float, default=0.05, 62 | help='# of features per letter/phoneme') 63 | parser.add_argument('--nspk', type=int, default=22, 64 | help='Number of speakers') 65 | parser.add_argument('--mem-size', type=int, default=20, 66 | help='Memory number of segments') 67 | 68 | 69 | # init 70 | args = parser.parse_args() 71 | args.expName = os.path.join('checkpoints', args.expName) 72 | torch.cuda.set_device(args.gpu) 73 | torch.manual_seed(args.seed) 74 | torch.cuda.manual_seed(args.seed) 75 | logging = create_output_dir(args) 76 | vis = visdom.Visdom(env=args.expName) 77 | 78 | 79 | # data 80 | logging.info("Building dataset.") 81 | train_dataset = NpzFolder(args.data + '/numpy_features', args.nspk == 1) 82 | train_loader = NpzLoader(train_dataset, 83 | max_seq_len=args.max_seq_len, 84 | batch_size=args.batch_size, 85 | num_workers=4, 86 | pin_memory=True, 87 | shuffle=True) 88 | 89 | valid_dataset = NpzFolder(args.data + '/numpy_features_valid', args.nspk == 1) 90 | valid_loader = NpzLoader(valid_dataset, 91 | max_seq_len=args.max_seq_len, 92 | batch_size=args.batch_size, 93 | num_workers=4, 94 | pin_memory=True) 95 | 96 | logging.info("Dataset ready!") 97 | 98 | 99 | def train(model, criterion, optimizer, epoch, train_losses): 100 | total = 0 # Reset every plot_every 101 | model.train() 102 | train_enum = tqdm(train_loader, desc='Train epoch %d' % epoch) 103 | 104 | for full_txt, full_feat, spkr in train_enum: 105 | batch_iter = TBPTTIter(full_txt, full_feat, spkr, args.seq_len) 106 | batch_total = 0 107 | 108 | for txt, feat, spkr, start in batch_iter: 109 | input = wrap(txt) 110 | target = wrap(feat) 111 | spkr = wrap(spkr) 112 | 113 | # Zero gradients 114 | if start: 115 | optimizer.zero_grad() 116 | 117 | # Forward 118 | output, _ = model([input, spkr], target[0], start) 119 | loss = criterion(output, target[0], target[1]) 120 | 121 | # Backward 122 | loss.backward() 123 | if check_grad(model.parameters(), args.clip_grad, args.ignore_grad): 124 | logging.info('Not a finite gradient or too big, ignoring.') 125 | optimizer.zero_grad() 126 | continue 127 | optimizer.step() 128 | 129 | # Keep track of loss 130 | batch_total += loss.data[0] 131 | 132 | batch_total = batch_total/len(batch_iter) 133 | total += batch_total 134 | train_enum.set_description('Train (loss %.2f) epoch %d' % 135 | (batch_total, epoch)) 136 | 137 | avg = total / len(train_loader) 138 | train_losses.append(avg) 139 | if args.visualize: 140 | vis.line(Y=np.asarray(train_losses), 141 | X=torch.arange(1, 1 + len(train_losses)), 142 | opts=dict(title="Train"), 143 | win='Train loss ' + args.expName) 144 | 145 | logging.info('====> Train set loss: {:.4f}'.format(avg)) 146 | 147 | 148 | def evaluate(model, criterion, epoch, eval_losses): 149 | total = 0 150 | valid_enum = tqdm(valid_loader, desc='Valid epoch %d' % epoch) 151 | 152 | for txt, feat, spkr in valid_enum: 153 | input = wrap(txt, volatile=True) 154 | target = wrap(feat, volatile=True) 155 | spkr = wrap(spkr, volatile=True) 156 | 157 | output, _ = model([input, spkr], target[0]) 158 | loss = criterion(output, target[0], target[1]) 159 | 160 | total += loss.data[0] 161 | 162 | valid_enum.set_description('Valid (loss %.2f) epoch %d' % 163 | (loss.data[0], epoch)) 164 | 165 | avg = total / len(valid_loader) 166 | eval_losses.append(avg) 167 | if args.visualize: 168 | vis.line(Y=np.asarray(eval_losses), 169 | X=torch.arange(1, 1 + len(eval_losses)), 170 | opts=dict(title="Eval"), 171 | win='Eval loss ' + args.expName) 172 | 173 | logging.info('====> Test set loss: {:.4f}'.format(avg)) 174 | return avg 175 | 176 | 177 | def main(): 178 | start_epoch = 1 179 | model = Loop(args) 180 | model.cuda() 181 | 182 | if args.checkpoint != '': 183 | checkpoint_args_path = os.path.dirname(args.checkpoint) + '/args.pth' 184 | checkpoint_args = torch.load(checkpoint_args_path) 185 | 186 | start_epoch = checkpoint_args[3] 187 | model.load_state_dict(torch.load(args.checkpoint)) 188 | 189 | criterion = MaskedMSE().cuda() 190 | optimizer = optim.Adam(model.parameters(), lr=args.lr) 191 | 192 | # Keep track of losses 193 | train_losses = [] 194 | eval_losses = [] 195 | best_eval = float('inf') 196 | 197 | # Begin! 198 | for epoch in range(start_epoch, start_epoch + args.epochs): 199 | train(model, criterion, optimizer, epoch, train_losses) 200 | eval_loss = evaluate(model, criterion, epoch, eval_losses) 201 | if eval_loss < best_eval: 202 | torch.save(model.state_dict(), '%s/bestmodel.pth' % (args.expName)) 203 | best_eval = eval_loss 204 | 205 | torch.save(model.state_dict(), '%s/lastmodel.pth' % (args.expName)) 206 | torch.save([args, train_losses, eval_losses, epoch], 207 | '%s/args.pth' % (args.expName)) 208 | 209 | 210 | if __name__ == '__main__': 211 | main() 212 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | from __future__ import print_function 8 | import os 9 | import logging 10 | import numpy 11 | import subprocess 12 | import time 13 | from datetime import timedelta 14 | 15 | 16 | import torch 17 | from torch.autograd import Variable 18 | 19 | 20 | class LogFormatter(): 21 | def __init__(self): 22 | self.start_time = time.time() 23 | 24 | def format(self, record): 25 | elapsed_seconds = round(record.created - self.start_time) 26 | 27 | prefix = "%s - %s - %s" % ( 28 | record.levelname, 29 | time.strftime('%x %X'), 30 | timedelta(seconds=elapsed_seconds) 31 | ) 32 | message = record.getMessage() 33 | message = message.replace('\n', '\n' + ' ' * (len(prefix) + 3)) 34 | return "%s - %s" % (prefix, message) 35 | 36 | 37 | def create_output_dir(opt): 38 | filepath = os.path.join(opt.expName, 'main.log') 39 | 40 | if not os.path.exists(opt.expName): 41 | os.makedirs(opt.expName) 42 | 43 | # Safety check 44 | if os.path.exists(filepath) and opt.checkpoint == "": 45 | logging.warning("Experiment already exists!") 46 | 47 | # Create logger 48 | log_formatter = LogFormatter() 49 | 50 | # create file handler and set level to debug 51 | file_handler = logging.FileHandler(filepath, "a") 52 | file_handler.setLevel(logging.DEBUG) 53 | file_handler.setFormatter(log_formatter) 54 | 55 | # create console handler and set level to info 56 | console_handler = logging.StreamHandler() 57 | console_handler.setLevel(logging.INFO) 58 | console_handler.setFormatter(log_formatter) 59 | 60 | # create logger and set level to debug 61 | logger = logging.getLogger() 62 | logger.handlers = [] 63 | logger.setLevel(logging.DEBUG) 64 | logger.propagate = False 65 | logger.addHandler(file_handler) 66 | logger.addHandler(console_handler) 67 | 68 | # quite down visdom 69 | logging.getLogger("requests").setLevel(logging.CRITICAL) 70 | logging.getLogger("urllib3").setLevel(logging.CRITICAL) 71 | 72 | # reset logger elapsed time 73 | def reset_time(): 74 | log_formatter.start_time = time.time() 75 | logger.reset_time = reset_time 76 | 77 | logger.info(opt) 78 | return logger 79 | 80 | 81 | def wrap(data, **kwargs): 82 | if torch.is_tensor(data): 83 | var = Variable(data, **kwargs).cuda() 84 | return var 85 | else: 86 | return tuple([wrap(x, **kwargs) for x in data]) 87 | 88 | 89 | def check_grad(params, clip_th, ignore_th): 90 | befgad = torch.nn.utils.clip_grad_norm(params, clip_th) 91 | return (not numpy.isfinite(befgad) or (befgad > ignore_th)) 92 | 93 | 94 | # Code taken from kastnerkyle gist: 95 | # https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300 96 | 97 | # Convenience function to reuse the defined env 98 | def pwrap(args, shell=False): 99 | p = subprocess.Popen(args, shell=shell, stdout=subprocess.PIPE, 100 | stdin=subprocess.PIPE, stderr=subprocess.PIPE, 101 | universal_newlines=True) 102 | return p 103 | 104 | # Print output 105 | # http://stackoverflow.com/questions/4417546/constantly-print-subprocess-output-while-process-is-running 106 | def execute(cmd, shell=False): 107 | popen = pwrap(cmd, shell=shell) 108 | for stdout_line in iter(popen.stdout.readline, ""): 109 | yield stdout_line 110 | 111 | popen.stdout.close() 112 | return_code = popen.wait() 113 | if return_code: 114 | raise subprocess.CalledProcessError(return_code, cmd) 115 | 116 | 117 | def pe(cmd, shell=False): 118 | """ 119 | Print and execute command on system 120 | """ 121 | for line in execute(cmd, shell=shell): 122 | print(line, end="") 123 | 124 | 125 | def array_to_binary_file(data, output_file_name): 126 | data = numpy.array(data, 'float32') 127 | fid = open(output_file_name, 'wb') 128 | data.tofile(fid) 129 | fid.close() 130 | 131 | 132 | def load_binary_file_frame(file_name, dimension): 133 | fid_lab = open(file_name, 'rb') 134 | features = numpy.fromfile(fid_lab, dtype=numpy.float32) 135 | fid_lab.close() 136 | assert features.size % float(dimension) == 0.0,'specified dimension %s not compatible with data'%(dimension) 137 | frame_number = features.size / dimension 138 | features = features[:(dimension * frame_number)] 139 | features = features.reshape((-1, dimension)) 140 | return features, frame_number 141 | 142 | 143 | def generate_merlin_wav( 144 | data, gen_dir, file_basename, norm_info_file, 145 | do_post_filtering=True, mgc_dim=60, fl=1024, sr=16000): 146 | # Made from Jose's code and Merlin 147 | gen_dir = os.path.abspath(gen_dir) + "/" 148 | if file_basename is None: 149 | base = "tmp_gen_wav" 150 | else: 151 | base = file_basename 152 | if not os.path.exists(gen_dir): 153 | os.mkdir(gen_dir) 154 | 155 | file_name = os.path.join(gen_dir, base + ".cmp") 156 | fid = open(norm_info_file, 'rb') 157 | cmp_info = numpy.fromfile(fid, dtype=numpy.float32) 158 | fid.close() 159 | cmp_info = cmp_info.reshape((2, -1)) 160 | cmp_mean = cmp_info[0, ] 161 | cmp_std = cmp_info[1, ] 162 | 163 | data = data * cmp_std + cmp_mean 164 | 165 | array_to_binary_file(data, file_name) 166 | # This code was adapted from Merlin. All licenses apply 167 | 168 | out_dimension_dict = {'bap': 1, 'lf0': 1, 'mgc': 60, 'vuv': 1} 169 | stream_start_index = {} 170 | file_extension_dict = { 171 | 'mgc': '.mgc', 'bap': '.bap', 'lf0': '.lf0', 172 | 'dur': '.dur', 'cmp': '.cmp'} 173 | gen_wav_features = ['mgc', 'lf0', 'bap'] 174 | 175 | dimension_index = 0 176 | for feature_name in out_dimension_dict.keys(): 177 | stream_start_index[feature_name] = dimension_index 178 | dimension_index += out_dimension_dict[feature_name] 179 | 180 | dir_name = os.path.dirname(file_name) 181 | file_id = os.path.splitext(os.path.basename(file_name))[0] 182 | features, frame_number = load_binary_file_frame(file_name, 63) 183 | 184 | for feature_name in gen_wav_features: 185 | 186 | current_features = features[ 187 | :, stream_start_index[feature_name]: 188 | stream_start_index[feature_name] + 189 | out_dimension_dict[feature_name]] 190 | 191 | gen_features = current_features 192 | 193 | if feature_name in ['lf0', 'F0']: 194 | if 'vuv' in stream_start_index.keys(): 195 | vuv_feature = features[ 196 | :, stream_start_index['vuv']:stream_start_index['vuv'] + 1] 197 | 198 | for i in range(frame_number): 199 | if vuv_feature[i, 0] < 0.5: 200 | gen_features[i, 0] = -1.0e+10 # self.inf_float 201 | 202 | new_file_name = os.path.join( 203 | dir_name, file_id + file_extension_dict[feature_name]) 204 | 205 | array_to_binary_file(gen_features, new_file_name) 206 | 207 | pf_coef = 1.4 208 | fw_alpha = 0.58 209 | co_coef = 511 210 | 211 | sptkdir = os.path.abspath(os.path.dirname(__file__) + "/tools/SPTK-3.9/") + '/' 212 | sptk_path = { 213 | 'SOPR': sptkdir + 'sopr', 214 | 'FREQT': sptkdir + 'freqt', 215 | 'VSTAT': sptkdir + 'vstat', 216 | 'MGC2SP': sptkdir + 'mgc2sp', 217 | 'MERGE': sptkdir + 'merge', 218 | 'BCP': sptkdir + 'bcp', 219 | 'MC2B': sptkdir + 'mc2b', 220 | 'C2ACR': sptkdir + 'c2acr', 221 | 'MLPG': sptkdir + 'mlpg', 222 | 'VOPR': sptkdir + 'vopr', 223 | 'B2MC': sptkdir + 'b2mc', 224 | 'X2X': sptkdir + 'x2x', 225 | 'VSUM': sptkdir + 'vsum'} 226 | 227 | worlddir = os.path.abspath(os.path.dirname(__file__) + "/tools/WORLD/") + '/' 228 | world_path = { 229 | 'ANALYSIS': worlddir + 'analysis', 230 | 'SYNTHESIS': worlddir + 'synth'} 231 | 232 | fw_coef = fw_alpha 233 | fl_coef = fl 234 | 235 | files = {'sp': base + '.sp', 236 | 'mgc': base + '.mgc', 237 | 'f0': base + '.f0', 238 | 'lf0': base + '.lf0', 239 | 'ap': base + '.ap', 240 | 'bap': base + '.bap', 241 | 'wav': base + '.wav'} 242 | 243 | mgc_file_name = files['mgc'] 244 | cur_dir = os.getcwd() 245 | os.chdir(gen_dir) 246 | 247 | # post-filtering 248 | if do_post_filtering: 249 | line = "echo 1 1 " 250 | for i in range(2, mgc_dim): 251 | line = line + str(pf_coef) + " " 252 | 253 | pe( 254 | '{line} | {x2x} +af > {weight}' 255 | .format( 256 | line=line, x2x=sptk_path['X2X'], 257 | weight=os.path.join(gen_dir, 'weight')), shell=True) 258 | 259 | pe( 260 | '{freqt} -m {order} -a {fw} -M {co} -A 0 < {mgc} | ' 261 | '{c2acr} -m {co} -M 0 -l {fl} > {base_r0}' 262 | .format( 263 | freqt=sptk_path['FREQT'], order=mgc_dim - 1, 264 | fw=fw_coef, co=co_coef, mgc=files['mgc'], 265 | c2acr=sptk_path['C2ACR'], fl=fl_coef, 266 | base_r0=files['mgc'] + '_r0'), shell=True) 267 | 268 | pe( 269 | '{vopr} -m -n {order} < {mgc} {weight} | ' 270 | '{freqt} -m {order} -a {fw} -M {co} -A 0 | ' 271 | '{c2acr} -m {co} -M 0 -l {fl} > {base_p_r0}' 272 | .format( 273 | vopr=sptk_path['VOPR'], order=mgc_dim - 1, 274 | mgc=files['mgc'], 275 | weight=os.path.join(gen_dir, 'weight'), 276 | freqt=sptk_path['FREQT'], fw=fw_coef, co=co_coef, 277 | c2acr=sptk_path['C2ACR'], fl=fl_coef, 278 | base_p_r0=files['mgc'] + '_p_r0'), shell=True) 279 | 280 | pe( 281 | '{vopr} -m -n {order} < {mgc} {weight} | ' 282 | '{mc2b} -m {order} -a {fw} | ' 283 | '{bcp} -n {order} -s 0 -e 0 > {base_b0}' 284 | .format( 285 | vopr=sptk_path['VOPR'], order=mgc_dim - 1, 286 | mgc=files['mgc'], 287 | weight=os.path.join(gen_dir, 'weight'), 288 | mc2b=sptk_path['MC2B'], fw=fw_coef, 289 | bcp=sptk_path['BCP'], base_b0=files['mgc'] + '_b0'), shell=True) 290 | 291 | pe( 292 | '{vopr} -d < {base_r0} {base_p_r0} | ' 293 | '{sopr} -LN -d 2 | {vopr} -a {base_b0} > {base_p_b0}' 294 | .format( 295 | vopr=sptk_path['VOPR'], 296 | base_r0=files['mgc'] + '_r0', 297 | base_p_r0=files['mgc'] + '_p_r0', 298 | sopr=sptk_path['SOPR'], 299 | base_b0=files['mgc'] + '_b0', 300 | base_p_b0=files['mgc'] + '_p_b0'), shell=True) 301 | 302 | pe( 303 | '{vopr} -m -n {order} < {mgc} {weight} | ' 304 | '{mc2b} -m {order} -a {fw} | ' 305 | '{bcp} -n {order} -s 1 -e {order} | ' 306 | '{merge} -n {order2} -s 0 -N 0 {base_p_b0} | ' 307 | '{b2mc} -m {order} -a {fw} > {base_p_mgc}' 308 | .format( 309 | vopr=sptk_path['VOPR'], order=mgc_dim - 1, 310 | mgc=files['mgc'], 311 | weight=os.path.join(gen_dir, 'weight'), 312 | mc2b=sptk_path['MC2B'], fw=fw_coef, 313 | bcp=sptk_path['BCP'], 314 | merge=sptk_path['MERGE'], order2=mgc_dim - 2, 315 | base_p_b0=files['mgc'] + '_p_b0', 316 | b2mc=sptk_path['B2MC'], 317 | base_p_mgc=files['mgc'] + '_p_mgc'), shell=True) 318 | 319 | mgc_file_name = files['mgc'] + '_p_mgc' 320 | 321 | # Vocoder WORLD 322 | 323 | pe( 324 | '{sopr} -magic -1.0E+10 -EXP -MAGIC 0.0 {lf0} | ' 325 | '{x2x} +fd > {f0}' 326 | .format( 327 | sopr=sptk_path['SOPR'], lf0=files['lf0'], 328 | x2x=sptk_path['X2X'], f0=files['f0']), shell=True) 329 | 330 | pe( 331 | '{sopr} -c 0 {bap} | {x2x} +fd > {ap}'.format( 332 | sopr=sptk_path['SOPR'], bap=files['bap'], 333 | x2x=sptk_path['X2X'], ap=files['ap']), shell=True) 334 | 335 | pe( 336 | '{mgc2sp} -a {alpha} -g 0 -m {order} -l {fl} -o 2 {mgc} | ' 337 | '{sopr} -d 32768.0 -P | {x2x} +fd > {sp}'.format( 338 | mgc2sp=sptk_path['MGC2SP'], alpha=fw_alpha, 339 | order=mgc_dim - 1, fl=fl, mgc=mgc_file_name, 340 | sopr=sptk_path['SOPR'], x2x=sptk_path['X2X'], sp=files['sp']), 341 | shell=True) 342 | 343 | pe( 344 | '{synworld} {fl} {sr} {f0} {sp} {ap} {wav}'.format( 345 | synworld=world_path['SYNTHESIS'], fl=fl, sr=sr, 346 | f0=files['f0'], sp=files['sp'], ap=files['ap'], 347 | wav=files['wav']), 348 | shell=True) 349 | 350 | pe( 351 | 'rm -f {ap} {sp} {f0} {bap} {lf0} {mgc} {mgc}_b0 {mgc}_p_b0 ' 352 | '{mgc}_p_mgc {mgc}_p_r0 {mgc}_r0 {cmp} weight'.format( 353 | ap=files['ap'], sp=files['sp'], f0=files['f0'], 354 | bap=files['bap'], lf0=files['lf0'], mgc=files['mgc'], 355 | cmp=base + '.cmp'), 356 | shell=True) 357 | os.chdir(cur_dir) 358 | --------------------------------------------------------------------------------