├── .idea ├── .gitignore ├── VITS_voice_conversion.iml ├── inspectionProfiles │ ├── Project_Default.xml │ └── profiles_settings.xml ├── misc.xml ├── modules.xml └── vcs.xml ├── DATA.MD ├── DATA_EN.MD ├── LICENSE ├── LOCAL.md ├── README.md ├── README_ZH.md ├── VC_inference.py ├── attentions.py ├── cmd_inference.py ├── commons.py ├── configs ├── modified_finetune_speaker.json └── uma_trilingual.json ├── data_utils.py ├── finetune_speaker_v2.py ├── losses.py ├── mel_processing.py ├── models.py ├── models_infer.py ├── modules.py ├── monotonic_align ├── __init__.py ├── core.pyx └── setup.py ├── preprocess_v2.py ├── requirements.txt ├── scripts ├── denoise_audio.py ├── download_model.py ├── download_video.py ├── long_audio_transcribe.py ├── rearrange_speaker.py ├── resample.py ├── short_audio_transcribe.py ├── video2audio.py └── voice_upload.py ├── text ├── LICENSE ├── __init__.py ├── __pycache__ │ ├── __init__.cpython-37.pyc │ ├── cleaners.cpython-37.pyc │ ├── english.cpython-37.pyc │ ├── japanese.cpython-37.pyc │ ├── korean.cpython-37.pyc │ ├── mandarin.cpython-37.pyc │ ├── sanskrit.cpython-37.pyc │ ├── symbols.cpython-37.pyc │ └── thai.cpython-37.pyc ├── cantonese.py ├── cleaners.py ├── english.py ├── japanese.py ├── korean.py ├── mandarin.py ├── ngu_dialect.py ├── sanskrit.py ├── shanghainese.py ├── symbols.py └── thai.py ├── transforms.py └── utils.py /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /shelf/ 3 | /workspace.xml 4 | -------------------------------------------------------------------------------- /.idea/VITS_voice_conversion.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 12 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/Project_Default.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 154 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /DATA.MD: -------------------------------------------------------------------------------- 1 | 本仓库的pipeline支持多种声音样本上传方式，您只需根据您所持有的样本选择任意一种或其中几种即可。 2 | 3 | 1.`.zip`文件打包的，按角色名排列的短音频，该压缩文件结构应如下所示： 4 | ``` 5 | Your-zip-file.zip 6 | ├───Character_name_1 7 | ├ ├───xxx.wav 8 | ├ ├───... 9 | ├ ├───yyy.mp3 10 | ├ └───zzz.wav 11 | ├───Character_name_2 12 | ├ ├───xxx.wav 13 | ├ ├───... 14 | ├ ├───yyy.mp3 15 | ├ └───zzz.wav 16 | ├───... 17 | ├ 18 | └───Character_name_n 19 | ├───xxx.wav 20 | ├───... 21 | ├───yyy.mp3 22 | └───zzz.wav 23 | ``` 24 | 注意音频的格式和名称都不重要，只要它们是音频文件。 25 | 质量要求：2秒以上，10秒以内，尽量不要有背景噪音。 26 | 数量要求：一个角色至少10条，最好每个角色20条以上。 27 | 2. 以角色名命名的长音频文件，音频内只能有单说话人，背景音会被自动去除。命名格式为：`{CharacterName}_{random_number}.wav` 28 | (例如：`Diana_234135.wav`, `MinatoAqua_234252.wav`)，必须是`.wav`文件，长度要在20分钟以内（否则会内存不足）。 29 | 30 | 3. 以角色名命名的长视频文件，视频内只能有单说话人，背景音会被自动去除。命名格式为：`{CharacterName}_{random_number}.mp4` 31 | (例如：`Taffy_332452.mp4`, `Dingzhen_957315.mp4`)，必须是`.mp4`文件，长度要在20分钟以内（否则会内存不足）。 32 | 注意：命名中，`CharacterName`必须是英文字符，`random_number`是为了区分同一个角色的多个文件，必须要添加，该数字可以为0~999999之间的任意整数。 33 | 34 | 4. 包含多行`{CharacterName}|{video_url}`的`.txt`文件，格式应如下所示： 35 | ``` 36 | Char1|https://xyz.com/video1/ 37 | Char2|https://xyz.com/video2/ 38 | Char2|https://xyz.com/video3/ 39 | Char3|https://xyz.com/video4/ 40 | ``` 41 | 视频内只能有单说话人，背景音会被自动去除。目前仅支持来自bilibili的视频，其它网站视频的url还没测试过。 42 | 若对格式有疑问，可以在[这里](https://drive.google.com/file/d/132l97zjanpoPY4daLgqXoM7HKXPRbS84/view?usp=sharing)找到所有格式对应的数据样本。 43 | -------------------------------------------------------------------------------- /DATA_EN.MD: -------------------------------------------------------------------------------- 1 | The pipeline of this repo supports multiple voice uploading options，you can choose one or more options depending on the data you have. 2 | 3 | 1. Short audios packed by a single `.zip` file, whose file structure should be as shown below: 4 | ``` 5 | Your-zip-file.zip 6 | ├───Character_name_1 7 | ├ ├───xxx.wav 8 | ├ ├───... 9 | ├ ├───yyy.mp3 10 | ├ └───zzz.wav 11 | ├───Character_name_2 12 | ├ ├───xxx.wav 13 | ├ ├───... 14 | ├ ├───yyy.mp3 15 | ├ └───zzz.wav 16 | ├───... 17 | ├ 18 | └───Character_name_n 19 | ├───xxx.wav 20 | ├───... 21 | ├───yyy.mp3 22 | └───zzz.wav 23 | ``` 24 | Note that the format of the audio files does not matter as long as they are audio files。 25 | Quality requirement: >=2s, <=10s, contain as little background sound as possible. 26 | Quantity requirement: at least 10 per character, 20+ per character is recommended. 27 | 28 | 2. Long audio files named by character names, which should contain single character voice only. Background sound is 29 | acceptable since they will be automatically removed. File name format `{CharacterName}_{random_number}.wav` 30 | (E.G. `Diana_234135.wav`, `MinatoAqua_234252.wav`), must be `.wav` files. 31 | 32 | 33 | 3. Long video files named by character names, which should contain single character voice only. Background sound is 34 | acceptable since they will be automatically removed. File name format `{CharacterName}_{random_number}.mp4` 35 | (E.G. `Taffy_332452.mp4`, `Dingzhen_957315.mp4`), must be `.mp4` files. 36 | Note: `CharacterName` must be English characters only, `random_number` is to identify multiple files for one character, 37 | which is compulsory to add. It could be a random integer between 0~999999. 38 | 39 | 4. A `.txt` containing multiple lines of`{CharacterName}|{video_url}`, which should be formatted as follows: 40 | ``` 41 | Char1|https://xyz.com/video1/ 42 | Char2|https://xyz.com/video2/ 43 | Char2|https://xyz.com/video3/ 44 | Char3|https://xyz.com/video4/ 45 | ``` 46 | One video should contain single speaker only. Currently supports videos links from bilibili, other websites are yet to be tested. 47 | Having questions regarding to data format? Fine data samples of all format from [here](https://drive.google.com/file/d/132l97zjanpoPY4daLgqXoM7HKXPRbS84/view?usp=sharing). 48 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /LOCAL.md: -------------------------------------------------------------------------------- 1 | # Train locally 2 | ### Build environment 3 | 0. Make sure you have installed `Python==3.8`, CMake & C/C++ compilers, ffmpeg; 4 | 1. Clone this repository; 5 | 2. Run `pip install -r requirements.txt`; 6 | 3. Install GPU version PyTorch: (Make sure you have CUDA 11.6 or 11.7 installed) 7 | ``` 8 | # CUDA 11.6 9 | pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116 10 | # CUDA 11.7 11 | pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 12 | ``` 13 | 4. Install necessary libraries for dealing video data: 14 | ``` 15 | pip install imageio==2.4.1 16 | pip install moviepy 17 | ``` 18 | 5. Build monotonic align (necessary for training) 19 | ``` 20 | cd monotonic_align 21 | mkdir monotonic_align 22 | python setup.py build_ext --inplace 23 | cd .. 24 | ``` 25 | 6. Download auxiliary data for training 26 | ``` 27 | mkdir pretrained_models 28 | # download data for fine-tuning 29 | wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip 30 | unzip sampled_audio4ft_v2.zip 31 | # create necessary directories 32 | mkdir video_data 33 | mkdir raw_audio 34 | mkdir denoised_audio 35 | mkdir custom_character_voice 36 | mkdir segmented_character_voice 37 | ``` 38 | 7. Download pretrained model, available options are: 39 | ``` 40 | CJE: Trilingual (Chinese, Japanese, English) 41 | CJ: Dualigual (Chinese, Japanese) 42 | C: Chinese only 43 | ``` 44 | ### Linux 45 | To download `CJE` model, run the following: 46 | ``` 47 | wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth -O ./pretrained_models/D_0.pth 48 | wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth -O ./pretrained_models/G_0.pth 49 | wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json -O ./configs/finetune_speaker.json 50 | ``` 51 | To download `CJ` model, run the following: 52 | ``` 53 | wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/D_0-p.pth -O ./pretrained_models/D_0.pth 54 | wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/G_0-p.pth -O ./pretrained_models/G_0.pth 55 | wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/config.json -O ./configs/finetune_speaker.json 56 | ``` 57 | To download `C` model, run the follwoing: 58 | ``` 59 | wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/D_0.pth -O ./pretrained_models/D_0.pth 60 | wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/G_0.pth -O ./pretrained_models/G_0.pth 61 | wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/config.json -O ./configs/finetune_speaker.json 62 | ``` 63 | ### Windows 64 | Manually download `G_0.pth`, `D_0.pth`, `finetune_speaker.json` from the URLs in one of the options described above. 65 | 66 | Rename all `G` models to `G_0.pth`, `D` models to `D_0.pth`, config files (`.json`) to `finetune_speaker.json`. 67 | Put `G_0.pth`, `D_0.pth` under `pretrained_models` directory; 68 | Put `finetune_speaker.json` under `configs` directory 69 | 70 | #### Please note that when you download one of them, the previous model will be overwritten. 71 | 9. Put your voice data under corresponding directories, see [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD) for detailed different uploading options. 72 | ### Short audios 73 | 1. Prepare your data according to [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD) as a single `.zip` file; 74 | 2. Put your file under directory `./custom_character_voice/`; 75 | 3. run `unzip ./custom_character_voice/custom_character_voice.zip -d ./custom_character_voice/` 76 | 77 | ### Long audios 78 | 1. Name your audio files according to [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD); 79 | 2. Put your renamed audio files under directory `./raw_audio/` 80 | 81 | ### Videos 82 | 1. Name your video files according to [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD); 83 | 2. Put your renamed video files under directory `./video_data/` 84 | 10. Process all audio data. 85 | ``` 86 | python scripts/video2audio.py 87 | python scripts/denoise_audio.py 88 | python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large 89 | python scripts/short_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large 90 | python scripts/resample.py 91 | ``` 92 | Replace `"{PRETRAINED_MODEL}"` with one of `{CJ, CJE, C}` according to your previous model choice. 93 | Make sure you have a minimum GPU memory of 12GB. If not, change the argument `whisper_size` to `medium` or `small`. 94 | 95 | 10. Process all text data. 96 | If you choose to add auxiliary data, run `python preprocess_v2.py --add_auxiliary_data True --languages "{PRETRAINED_MODEL}"` 97 | If not, run `python preprocess_v2.py --languages "{PRETRAINED_MODEL}"` 98 | Do replace `"{PRETRAINED_MODEL}"` with one of `{CJ, CJE, C}` according to your previous model choice. 99 | 100 | 11. Start Training. 101 | Run `python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed True` 102 | Do replace `{Maximum_epochs}` with your desired number of epochs. Empirically, 100 or more is recommended. 103 | To continue training on previous checkpoint, change the training command to: `python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed False --cont True`. Before you do this, make sure you have previous `G_latest.pth` and `D_latest.pth` under `./OUTPUT_MODEL/` directory. 104 | To view training progress, open a new terminal and `cd` to the project root directory, run `tensorboard --logdir=./OUTPUT_MODEL`, then visit `localhost:6006` with your web browser. 105 | 106 | 12. After training is completed, you can use your model by running: 107 | `python VC_inference.py --model_dir ./OUTPUT_MODEL/G_latest.pth --share True` 108 | 13. To clear all audio data, run: 109 | ### Linux 110 | ``` 111 | rm -rf ./custom_character_voice/* ./video_data/* ./raw_audio/* ./denoised_audio/* ./segmented_character_voice/* ./separated/* long_character_anno.txt short_character_anno.txt 112 | ``` 113 | ### Windows 114 | ``` 115 | del /Q /S .\custom_character_voice\* .\video_data\* .\raw_audio\* .\denoised_audio\* .\segmented_character_voice\* .\separated\* long_character_anno.txt short_character_anno.txt 116 | ``` 117 | 118 | 119 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [中文文档请点击这里](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/README_ZH.md) 2 | # VITS Fast Fine-tuning 3 | This repo will guide you to add your own character voices, or even your own voice, into existing VITS TTS model 4 | to make it able to do the following tasks in less than 1 hour: 5 | 6 | 1. Many-to-many voice conversion between any characters you added & preset characters in the model. 7 | 2. English, Japanese & Chinese Text-to-Speech synthesis with the characters you added & preset characters 8 | 9 | 10 | Welcome to play around with the base models! 11 | Chinese & English & Japanese：[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer) Author: Me 12 | 13 | Chinese & Japanese：[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai) Author: [SayaSS](https://github.com/SayaSS) 14 | 15 | Chinese only：(No running huggingface spaces) Author: [Wwwwhy230825](https://github.com/Wwwwhy230825) 16 | 17 | 18 | ### Currently Supported Tasks: 19 | - [x] Clone character voice from 10+ short audios 20 | - [x] Clone character voice from long audio(s) >= 3 minutes (one audio should contain single speaker only) 21 | - [x] Clone character voice from videos(s) >= 3 minutes (one video should contain single speaker only) 22 | - [x] Clone character voice from BILIBILI video links (one video should contain single speaker only) 23 | 24 | ### Currently Supported Characters for TTS & VC: 25 | - [x] Any character you wish as long as you have their voices! 26 | (Note that voice conversion can only be conducted between any two speakers in the model) 27 | 28 | 29 | 30 | ## Fine-tuning 31 | See [LOCAL.md](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/LOCAL.md) for local training guide. 32 | Alternatively, you can perform fine-tuning on [Google Colab](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing) 33 | 34 | 35 | ### How long does it take? 36 | 1. Install dependencies (3 min) 37 | 2. Choose pretrained model to start. The detailed differences between them are described in [Colab Notebook](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing) 38 | 3. Upload the voice samples of the characters you wish to add，see [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD) for detailed uploading options. 39 | 4. Start fine-tuning. Time taken varies from 20 minutes ~ 2 hours, depending on the number of voices you uploaded. 40 | 41 | 42 | ## Inference or Usage (Currently support Windows only) 43 | 0. Remember to download your fine-tuned model! 44 | 1. Download the latest release 45 | 2. Put your model & config file into the folder `inference`, which are named `G_latest.pth` and `finetune_speaker.json`, respectively. 46 | 3. The file structure should be as follows: 47 | ``` 48 | inference 49 | ├───inference.exe 50 | ├───... 51 | ├───finetune_speaker.json 52 | └───G_latest.pth 53 | ``` 54 | 4. run `inference.exe`, the browser should pop up automatically. 55 | 5. Note: you must install `ffmpeg` to enable voice conversion feature. 56 | 57 | 58 | ## Inference with CLI 59 | In this example, we will show how to run inference with the default pretrained model. We are now in the main repository directory. 60 | 1. Create the necessary folders and download the necessary files. 61 | ``` 62 | cd monotonic_align/ 63 | mkdir monotonic_align 64 | python setup.py build_ext --inplace 65 | cd .. 66 | mkdir pretrained_models 67 | # download data for fine-tuning 68 | wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip 69 | unzip sampled_audio4ft_v2.zip 70 | ``` 71 | 72 | For your finetuned model you may need to create additional directories: 73 | ``` 74 | mkdir video_data 75 | mkdir raw_audio 76 | mkdir denoised_audio 77 | mkdir custom_character_voice 78 | mkdir segmented_character_voice 79 | ``` 80 | 2. Download pretrained models. For example, trilingual model: 81 | ``` 82 | wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth -O ./pretrained_models/D_0.pth 83 | wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth -O ./pretrained_models/G_0.pth 84 | wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json -O ./configs/finetune_speaker.json 85 | ``` 86 | 3. Activate your environment and run the following code: 87 | `python3 cmd_inference.py -m pretrained_models/G_0.pth -c configs/finetune_speaker.json -t 你好，训练员先生，很高兴见到你。 -s "派蒙 Paimon (Genshin Impact)" -l "简体中文"` 88 | You can choose another language, customize output folder, change text and character, but all these parameters you can see in the file `cmd_inference.py`. 89 | Below I'll show only how to change the character. 90 | 4. To change the character please open config file (`configs/finetune_speaker.json`). There you can find dictionary `speakers`, where you'll be able to see full list of speakers. Just copy the name of the character you need use it instead of `"派蒙 Paimon (Genshin Impact)"` 91 | 5. If you have success, you can find output `.wav` file in the `output/vits` 92 | 93 | 94 | ## Use in MoeGoe 95 | 0. Prepare downloaded model & config file, which are named `G_latest.pth` and `moegoe_config.json`, respectively. 96 | 1. Follow [MoeGoe](https://github.com/CjangCjengh/MoeGoe) page instructions to install, configure path, and use. 97 | 98 | ## Looking for help? 99 | If you have any questions, please feel free to open an [issue](https://github.com/Plachtaa/VITS-fast-fine-tuning/issues/new) or join our [Discord](https://discord.gg/TcrjDFvm5A) server. 100 | -------------------------------------------------------------------------------- /README_ZH.md: -------------------------------------------------------------------------------- 1 | English Documentation Please Click [here](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/README.md) 2 | # VITS 快速微调 3 | 这个代码库会指导你如何将自定义角色（甚至你自己），加入预训练的VITS模型中，在1小时内的微调使模型具备如下功能： 4 | 1. 在模型所包含的任意两个角色之间进行声线转换 5 | 2. 以你加入的角色声线进行中日英三语文本到语音合成。 6 | 7 | 本项目使用的底模涵盖常见二次元男/女配音声线（来自原神数据集）以及现实世界常见男/女声线（来自VCTK数据集），支持中日英三语，保证能够在微调时快速适应新的声线。 8 | 9 | 欢迎体验微调所使用的底模！ 10 | 11 | 中日英：[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer) 作者：我 12 | 13 | 中日：[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai) 作者：[SayaSS](https://github.com/SayaSS) 14 | 15 | 纯中文：（没有huggingface demo）作者：[Wwwwhy230825](https://github.com/Wwwwhy230825) 16 | 17 | ### 目前支持的任务: 18 | - [x] 从 10条以上的短音频克隆角色声音 19 | - [x] 从 3分钟以上的长音频（单个音频只能包含单说话人）克隆角色声音 20 | - [x] 从 3分钟以上的视频（单个视频只能包含单说话人）克隆角色声音 21 | - [x] 通过输入 bilibili视频链接（单个视频只能包含单说话人）克隆角色声音 22 | 23 | ### 目前支持声线转换和中日英三语TTS的角色 24 | - [x] 任意角色（只要你有角色的声音样本） 25 | （注意：声线转换只能在任意两个存在于模型中的说话人之间进行） 26 | 27 | 28 | 29 | 30 | ## 微调 31 | 若希望于本地机器进行训练，请参考[LOCAL.md](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/LOCAL.md)以进行。 32 | 另外，也可以选择使用 [Google Colab](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing) 33 | 进行微调任务。 34 | ### 我需要花多长时间？ 35 | 1. 安装依赖 (10 min在Google Colab中) 36 | 2. 选择预训练模型，详细区别参见[Colab 笔记本页面](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing)。 37 | 3. 上传你希望加入的其它角色声音，详细上传方式见[DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA.MD) 38 | 4. 进行微调，根据选择的微调方式和样本数量不同，花费时长可能在20分钟到2小时不等。 39 | 40 | 微调结束后可以直接下载微调好的模型，日后在本地运行（不需要GPU） 41 | 42 | ## 本地运行和推理 43 | 0. 记得下载微调好的模型和config文件！ 44 | 1. 下载最新的Release包（在Github页面的右侧） 45 | 2. 把下载的模型和config文件放在 `inference`文件夹下, 其文件名分别为 `G_latest.pth` 和 `finetune_speaker.json`。 46 | 3. 一切准备就绪后，文件结构应该如下所示: 47 | ``` 48 | inference 49 | ├───inference.exe 50 | ├───... 51 | ├───finetune_speaker.json 52 | └───G_latest.pth 53 | ``` 54 | 4. 运行 `inference.exe`, 浏览器会自动弹出窗口, 注意其所在路径不能有中文字符或者空格. 55 | 5. 请注意，声线转换功能需要安装`ffmpeg`才能正常使用. 56 | 57 | ## 在MoeGoe使用 58 | 0. MoeGoe以及类似其它VITS推理UI使用的config格式略有不同，需要下载的文件为模型`G_latest.pth`和配置文件`moegoe_config.json` 59 | 1. 按照[MoeGoe](https://github.com/CjangCjengh/MoeGoe)页面的提示配置路径即可使用。 60 | 2. MoeGoe在输入句子时需要使用相应的语言标记包裹句子才能正常合成。（日语用[JA], 中文用[ZH], 英文用[EN]），例如： 61 | [JA]こんにちわ。[JA] 62 | [ZH]你好！[ZH] 63 | [EN]Hello![EN] 64 | 65 | ## 帮助 66 | 如果你在使用过程中遇到了任何问题，可以在[这里](https://github.com/Plachtaa/VITS-fast-fine-tuning/issues/new)开一个issue，或者加入Discord服务器寻求帮助：[Discord](https://discord.gg/TcrjDFvm5A)。 67 | -------------------------------------------------------------------------------- /VC_inference.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import torch 4 | from torch import no_grad, LongTensor 5 | import argparse 6 | import commons 7 | from mel_processing import spectrogram_torch 8 | import utils 9 | from models import SynthesizerTrn 10 | import gradio as gr 11 | import librosa 12 | import webbrowser 13 | 14 | from text import text_to_sequence, _clean_text 15 | device = "cuda:0" if torch.cuda.is_available() else "cpu" 16 | import logging 17 | logging.getLogger("PIL").setLevel(logging.WARNING) 18 | logging.getLogger("urllib3").setLevel(logging.WARNING) 19 | logging.getLogger("markdown_it").setLevel(logging.WARNING) 20 | logging.getLogger("httpx").setLevel(logging.WARNING) 21 | logging.getLogger("asyncio").setLevel(logging.WARNING) 22 | 23 | language_marks = { 24 | "Japanese": "", 25 | "日本語": "[JA]", 26 | "简体中文": "[ZH]", 27 | "English": "[EN]", 28 | "Mix": "", 29 | } 30 | lang = ['日本語', '简体中文', 'English', 'Mix'] 31 | def get_text(text, hps, is_symbol): 32 | text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners) 33 | if hps.data.add_blank: 34 | text_norm = commons.intersperse(text_norm, 0) 35 | text_norm = LongTensor(text_norm) 36 | return text_norm 37 | 38 | def create_tts_fn(model, hps, speaker_ids): 39 | def tts_fn(text, speaker, language, speed): 40 | if language is not None: 41 | text = language_marks[language] + text + language_marks[language] 42 | speaker_id = speaker_ids[speaker] 43 | stn_tst = get_text(text, hps, False) 44 | with no_grad(): 45 | x_tst = stn_tst.unsqueeze(0).to(device) 46 | x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device) 47 | sid = LongTensor([speaker_id]).to(device) 48 | audio = model.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8, 49 | length_scale=1.0 / speed)[0][0, 0].data.cpu().float().numpy() 50 | del stn_tst, x_tst, x_tst_lengths, sid 51 | return "Success", (hps.data.sampling_rate, audio) 52 | 53 | return tts_fn 54 | 55 | def create_vc_fn(model, hps, speaker_ids): 56 | def vc_fn(original_speaker, target_speaker, record_audio, upload_audio): 57 | input_audio = record_audio if record_audio is not None else upload_audio 58 | if input_audio is None: 59 | return "You need to record or upload an audio", None 60 | sampling_rate, audio = input_audio 61 | original_speaker_id = speaker_ids[original_speaker] 62 | target_speaker_id = speaker_ids[target_speaker] 63 | 64 | audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32) 65 | if len(audio.shape) > 1: 66 | audio = librosa.to_mono(audio.transpose(1, 0)) 67 | if sampling_rate != hps.data.sampling_rate: 68 | audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=hps.data.sampling_rate) 69 | with no_grad(): 70 | y = torch.FloatTensor(audio) 71 | y = y / max(-y.min(), y.max()) / 0.99 72 | y = y.to(device) 73 | y = y.unsqueeze(0) 74 | spec = spectrogram_torch(y, hps.data.filter_length, 75 | hps.data.sampling_rate, hps.data.hop_length, hps.data.win_length, 76 | center=False).to(device) 77 | spec_lengths = LongTensor([spec.size(-1)]).to(device) 78 | sid_src = LongTensor([original_speaker_id]).to(device) 79 | sid_tgt = LongTensor([target_speaker_id]).to(device) 80 | audio = model.voice_conversion(spec, spec_lengths, sid_src=sid_src, sid_tgt=sid_tgt)[0][ 81 | 0, 0].data.cpu().float().numpy() 82 | del y, spec, spec_lengths, sid_src, sid_tgt 83 | return "Success", (hps.data.sampling_rate, audio) 84 | 85 | return vc_fn 86 | if __name__ == "__main__": 87 | parser = argparse.ArgumentParser() 88 | parser.add_argument("--model_dir", default="./G_latest.pth", help="directory to your fine-tuned model") 89 | parser.add_argument("--config_dir", default="./finetune_speaker.json", help="directory to your model config file") 90 | parser.add_argument("--share", default=False, help="make link public (used in colab)") 91 | 92 | args = parser.parse_args() 93 | hps = utils.get_hparams_from_file(args.config_dir) 94 | 95 | 96 | net_g = SynthesizerTrn( 97 | len(hps.symbols), 98 | hps.data.filter_length // 2 + 1, 99 | hps.train.segment_size // hps.data.hop_length, 100 | n_speakers=hps.data.n_speakers, 101 | **hps.model).to(device) 102 | _ = net_g.eval() 103 | 104 | _ = utils.load_checkpoint(args.model_dir, net_g, None) 105 | speaker_ids = hps.speakers 106 | speakers = list(hps.speakers.keys()) 107 | tts_fn = create_tts_fn(net_g, hps, speaker_ids) 108 | vc_fn = create_vc_fn(net_g, hps, speaker_ids) 109 | app = gr.Blocks() 110 | with app: 111 | with gr.Tab("Text-to-Speech"): 112 | with gr.Row(): 113 | with gr.Column(): 114 | textbox = gr.TextArea(label="Text", 115 | placeholder="Type your sentence here", 116 | value="こんにちわ。", elem_id=f"tts-input") 117 | # select character 118 | char_dropdown = gr.Dropdown(choices=speakers, value=speakers[0], label='character') 119 | language_dropdown = gr.Dropdown(choices=lang, value=lang[0], label='language') 120 | duration_slider = gr.Slider(minimum=0.1, maximum=5, value=1, step=0.1, 121 | label='速度 Speed') 122 | with gr.Column(): 123 | text_output = gr.Textbox(label="Message") 124 | audio_output = gr.Audio(label="Output Audio", elem_id="tts-audio") 125 | btn = gr.Button("Generate!") 126 | btn.click(tts_fn, 127 | inputs=[textbox, char_dropdown, language_dropdown, duration_slider,], 128 | outputs=[text_output, audio_output]) 129 | with gr.Tab("Voice Conversion"): 130 | gr.Markdown(""" 131 | 录制或上传声音，并选择要转换的音色。 132 | """) 133 | with gr.Column(): 134 | record_audio = gr.Audio(label="record your voice", source="microphone") 135 | upload_audio = gr.Audio(label="or upload audio here", source="upload") 136 | source_speaker = gr.Dropdown(choices=speakers, value=speakers[0], label="source speaker") 137 | target_speaker = gr.Dropdown(choices=speakers, value=speakers[0], label="target speaker") 138 | with gr.Column(): 139 | message_box = gr.Textbox(label="Message") 140 | converted_audio = gr.Audio(label='converted audio') 141 | btn = gr.Button("Convert!") 142 | btn.click(vc_fn, inputs=[source_speaker, target_speaker, record_audio, upload_audio], 143 | outputs=[message_box, converted_audio]) 144 | webbrowser.open("http://127.0.0.1:7860") 145 | app.launch(share=args.share) 146 | 147 | -------------------------------------------------------------------------------- /attentions.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import math 3 | import numpy as np 4 | import torch 5 | from torch import nn 6 | from torch.nn import functional as F 7 | 8 | import commons 9 | import modules 10 | from modules import LayerNorm 11 | 12 | 13 | class Encoder(nn.Module): 14 | def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, kernel_size=1, p_dropout=0., window_size=4, **kwargs): 15 | super().__init__() 16 | self.hidden_channels = hidden_channels 17 | self.filter_channels = filter_channels 18 | self.n_heads = n_heads 19 | self.n_layers = n_layers 20 | self.kernel_size = kernel_size 21 | self.p_dropout = p_dropout 22 | self.window_size = window_size 23 | 24 | self.drop = nn.Dropout(p_dropout) 25 | self.attn_layers = nn.ModuleList() 26 | self.norm_layers_1 = nn.ModuleList() 27 | self.ffn_layers = nn.ModuleList() 28 | self.norm_layers_2 = nn.ModuleList() 29 | for i in range(self.n_layers): 30 | self.attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout, window_size=window_size)) 31 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 32 | self.ffn_layers.append(FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout)) 33 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 34 | 35 | def forward(self, x, x_mask): 36 | attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 37 | x = x * x_mask 38 | for i in range(self.n_layers): 39 | y = self.attn_layers[i](x, x, attn_mask) 40 | y = self.drop(y) 41 | x = self.norm_layers_1[i](x + y) 42 | 43 | y = self.ffn_layers[i](x, x_mask) 44 | y = self.drop(y) 45 | x = self.norm_layers_2[i](x + y) 46 | x = x * x_mask 47 | return x 48 | 49 | 50 | class Decoder(nn.Module): 51 | def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, kernel_size=1, p_dropout=0., proximal_bias=False, proximal_init=True, **kwargs): 52 | super().__init__() 53 | self.hidden_channels = hidden_channels 54 | self.filter_channels = filter_channels 55 | self.n_heads = n_heads 56 | self.n_layers = n_layers 57 | self.kernel_size = kernel_size 58 | self.p_dropout = p_dropout 59 | self.proximal_bias = proximal_bias 60 | self.proximal_init = proximal_init 61 | 62 | self.drop = nn.Dropout(p_dropout) 63 | self.self_attn_layers = nn.ModuleList() 64 | self.norm_layers_0 = nn.ModuleList() 65 | self.encdec_attn_layers = nn.ModuleList() 66 | self.norm_layers_1 = nn.ModuleList() 67 | self.ffn_layers = nn.ModuleList() 68 | self.norm_layers_2 = nn.ModuleList() 69 | for i in range(self.n_layers): 70 | self.self_attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout, proximal_bias=proximal_bias, proximal_init=proximal_init)) 71 | self.norm_layers_0.append(LayerNorm(hidden_channels)) 72 | self.encdec_attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout)) 73 | self.norm_layers_1.append(LayerNorm(hidden_channels)) 74 | self.ffn_layers.append(FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout, causal=True)) 75 | self.norm_layers_2.append(LayerNorm(hidden_channels)) 76 | 77 | def forward(self, x, x_mask, h, h_mask): 78 | """ 79 | x: decoder input 80 | h: encoder output 81 | """ 82 | self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(device=x.device, dtype=x.dtype) 83 | encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1) 84 | x = x * x_mask 85 | for i in range(self.n_layers): 86 | y = self.self_attn_layers[i](x, x, self_attn_mask) 87 | y = self.drop(y) 88 | x = self.norm_layers_0[i](x + y) 89 | 90 | y = self.encdec_attn_layers[i](x, h, encdec_attn_mask) 91 | y = self.drop(y) 92 | x = self.norm_layers_1[i](x + y) 93 | 94 | y = self.ffn_layers[i](x, x_mask) 95 | y = self.drop(y) 96 | x = self.norm_layers_2[i](x + y) 97 | x = x * x_mask 98 | return x 99 | 100 | 101 | class MultiHeadAttention(nn.Module): 102 | def __init__(self, channels, out_channels, n_heads, p_dropout=0., window_size=None, heads_share=True, block_length=None, proximal_bias=False, proximal_init=False): 103 | super().__init__() 104 | assert channels % n_heads == 0 105 | 106 | self.channels = channels 107 | self.out_channels = out_channels 108 | self.n_heads = n_heads 109 | self.p_dropout = p_dropout 110 | self.window_size = window_size 111 | self.heads_share = heads_share 112 | self.block_length = block_length 113 | self.proximal_bias = proximal_bias 114 | self.proximal_init = proximal_init 115 | self.attn = None 116 | 117 | self.k_channels = channels // n_heads 118 | self.conv_q = nn.Conv1d(channels, channels, 1) 119 | self.conv_k = nn.Conv1d(channels, channels, 1) 120 | self.conv_v = nn.Conv1d(channels, channels, 1) 121 | self.conv_o = nn.Conv1d(channels, out_channels, 1) 122 | self.drop = nn.Dropout(p_dropout) 123 | 124 | if window_size is not None: 125 | n_heads_rel = 1 if heads_share else n_heads 126 | rel_stddev = self.k_channels**-0.5 127 | self.emb_rel_k = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev) 128 | self.emb_rel_v = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev) 129 | 130 | nn.init.xavier_uniform_(self.conv_q.weight) 131 | nn.init.xavier_uniform_(self.conv_k.weight) 132 | nn.init.xavier_uniform_(self.conv_v.weight) 133 | if proximal_init: 134 | with torch.no_grad(): 135 | self.conv_k.weight.copy_(self.conv_q.weight) 136 | self.conv_k.bias.copy_(self.conv_q.bias) 137 | 138 | def forward(self, x, c, attn_mask=None): 139 | q = self.conv_q(x) 140 | k = self.conv_k(c) 141 | v = self.conv_v(c) 142 | 143 | x, self.attn = self.attention(q, k, v, mask=attn_mask) 144 | 145 | x = self.conv_o(x) 146 | return x 147 | 148 | def attention(self, query, key, value, mask=None): 149 | # reshape [b, d, t] -> [b, n_h, t, d_k] 150 | b, d, t_s, t_t = (*key.size(), query.size(2)) 151 | query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3) 152 | key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3) 153 | value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3) 154 | 155 | scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1)) 156 | if self.window_size is not None: 157 | assert t_s == t_t, "Relative attention is only available for self-attention." 158 | key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s) 159 | rel_logits = self._matmul_with_relative_keys(query /math.sqrt(self.k_channels), key_relative_embeddings) 160 | scores_local = self._relative_position_to_absolute_position(rel_logits) 161 | scores = scores + scores_local 162 | if self.proximal_bias: 163 | assert t_s == t_t, "Proximal bias is only available for self-attention." 164 | scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype) 165 | if mask is not None: 166 | scores = scores.masked_fill(mask == 0, -1e4) 167 | if self.block_length is not None: 168 | assert t_s == t_t, "Local attention is only available for self-attention." 169 | block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length) 170 | scores = scores.masked_fill(block_mask == 0, -1e4) 171 | p_attn = F.softmax(scores, dim=-1) # [b, n_h, t_t, t_s] 172 | p_attn = self.drop(p_attn) 173 | output = torch.matmul(p_attn, value) 174 | if self.window_size is not None: 175 | relative_weights = self._absolute_position_to_relative_position(p_attn) 176 | value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s) 177 | output = output + self._matmul_with_relative_values(relative_weights, value_relative_embeddings) 178 | output = output.transpose(2, 3).contiguous().view(b, d, t_t) # [b, n_h, t_t, d_k] -> [b, d, t_t] 179 | return output, p_attn 180 | 181 | def _matmul_with_relative_values(self, x, y): 182 | """ 183 | x: [b, h, l, m] 184 | y: [h or 1, m, d] 185 | ret: [b, h, l, d] 186 | """ 187 | ret = torch.matmul(x, y.unsqueeze(0)) 188 | return ret 189 | 190 | def _matmul_with_relative_keys(self, x, y): 191 | """ 192 | x: [b, h, l, d] 193 | y: [h or 1, m, d] 194 | ret: [b, h, l, m] 195 | """ 196 | ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1)) 197 | return ret 198 | 199 | def _get_relative_embeddings(self, relative_embeddings, length): 200 | max_relative_position = 2 * self.window_size + 1 201 | # Pad first before slice to avoid using cond ops. 202 | pad_length = max(length - (self.window_size + 1), 0) 203 | slice_start_position = max((self.window_size + 1) - length, 0) 204 | slice_end_position = slice_start_position + 2 * length - 1 205 | if pad_length > 0: 206 | padded_relative_embeddings = F.pad( 207 | relative_embeddings, 208 | commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]])) 209 | else: 210 | padded_relative_embeddings = relative_embeddings 211 | used_relative_embeddings = padded_relative_embeddings[:,slice_start_position:slice_end_position] 212 | return used_relative_embeddings 213 | 214 | def _relative_position_to_absolute_position(self, x): 215 | """ 216 | x: [b, h, l, 2*l-1] 217 | ret: [b, h, l, l] 218 | """ 219 | batch, heads, length, _ = x.size() 220 | # Concat columns of pad to shift from relative to absolute indexing. 221 | x = F.pad(x, commons.convert_pad_shape([[0,0],[0,0],[0,0],[0,1]])) 222 | 223 | # Concat extra elements so to add up to shape (len+1, 2*len-1). 224 | x_flat = x.view([batch, heads, length * 2 * length]) 225 | x_flat = F.pad(x_flat, commons.convert_pad_shape([[0,0],[0,0],[0,length-1]])) 226 | 227 | # Reshape and slice out the padded elements. 228 | x_final = x_flat.view([batch, heads, length+1, 2*length-1])[:, :, :length, length-1:] 229 | return x_final 230 | 231 | def _absolute_position_to_relative_position(self, x): 232 | """ 233 | x: [b, h, l, l] 234 | ret: [b, h, l, 2*l-1] 235 | """ 236 | batch, heads, length, _ = x.size() 237 | # padd along column 238 | x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length-1]])) 239 | x_flat = x.view([batch, heads, length**2 + length*(length -1)]) 240 | # add 0's in the beginning that will skew the elements after reshape 241 | x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]])) 242 | x_final = x_flat.view([batch, heads, length, 2*length])[:,:,:,1:] 243 | return x_final 244 | 245 | def _attention_bias_proximal(self, length): 246 | """Bias for self-attention to encourage attention to close positions. 247 | Args: 248 | length: an integer scalar. 249 | Returns: 250 | a Tensor with shape [1, 1, length, length] 251 | """ 252 | r = torch.arange(length, dtype=torch.float32) 253 | diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1) 254 | return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0) 255 | 256 | 257 | class FFN(nn.Module): 258 | def __init__(self, in_channels, out_channels, filter_channels, kernel_size, p_dropout=0., activation=None, causal=False): 259 | super().__init__() 260 | self.in_channels = in_channels 261 | self.out_channels = out_channels 262 | self.filter_channels = filter_channels 263 | self.kernel_size = kernel_size 264 | self.p_dropout = p_dropout 265 | self.activation = activation 266 | self.causal = causal 267 | 268 | if causal: 269 | self.padding = self._causal_padding 270 | else: 271 | self.padding = self._same_padding 272 | 273 | self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size) 274 | self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size) 275 | self.drop = nn.Dropout(p_dropout) 276 | 277 | def forward(self, x, x_mask): 278 | x = self.conv_1(self.padding(x * x_mask)) 279 | if self.activation == "gelu": 280 | x = x * torch.sigmoid(1.702 * x) 281 | else: 282 | x = torch.relu(x) 283 | x = self.drop(x) 284 | x = self.conv_2(self.padding(x * x_mask)) 285 | return x * x_mask 286 | 287 | def _causal_padding(self, x): 288 | if self.kernel_size == 1: 289 | return x 290 | pad_l = self.kernel_size - 1 291 | pad_r = 0 292 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 293 | x = F.pad(x, commons.convert_pad_shape(padding)) 294 | return x 295 | 296 | def _same_padding(self, x): 297 | if self.kernel_size == 1: 298 | return x 299 | pad_l = (self.kernel_size - 1) // 2 300 | pad_r = self.kernel_size // 2 301 | padding = [[0, 0], [0, 0], [pad_l, pad_r]] 302 | x = F.pad(x, commons.convert_pad_shape(padding)) 303 | return x 304 | -------------------------------------------------------------------------------- /cmd_inference.py: -------------------------------------------------------------------------------- 1 | """该模块用于生成VITS文件 2 | 使用方法 3 | 4 | python cmd_inference.py -m 模型路径 -c 配置文件路径 -o 输出文件路径 -l 输入的语言 -t 输入文本 -s 合成目标说话人名称 5 | 6 | 可选参数 7 | -ns 感情变化程度 8 | -nsw 音素发音长度 9 | -ls 整体语速 10 | -on 输出文件的名称 11 | 12 | """ 13 | """English version of this module, which is used to generate VITS files 14 | Instructions 15 | 16 | python cmd_inference.py -m model_path -c configuration_file_path -o output_file_path -l input_language -t input_text -s synthesize_target_speaker_name 17 | 18 | Optional parameters 19 | -ns degree of emotional change 20 | -nsw phoneme pronunciation length 21 | -ls overall speaking speed 22 | -on name of the output file 23 | """ 24 | 25 | from pathlib import Path 26 | import utils 27 | from models import SynthesizerTrn 28 | import torch 29 | from torch import no_grad, LongTensor 30 | import librosa 31 | from text import text_to_sequence, _clean_text 32 | import commons 33 | import scipy.io.wavfile as wavf 34 | import os 35 | 36 | device = "cuda:0" if torch.cuda.is_available() else "cpu" 37 | 38 | language_marks = { 39 | "Japanese": "", 40 | "日本語": "[JA]", 41 | "简体中文": "[ZH]", 42 | "English": "[EN]", 43 | "Mix": "", 44 | } 45 | 46 | 47 | def get_text(text, hps, is_symbol): 48 | text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners) 49 | if hps.data.add_blank: 50 | text_norm = commons.intersperse(text_norm, 0) 51 | text_norm = LongTensor(text_norm) 52 | return text_norm 53 | 54 | 55 | 56 | if __name__ == "__main__": 57 | import argparse 58 | 59 | """ 60 | English description of some parameters: 61 | -s - speaker name, you should use name, not the number 62 | """ 63 | parser = argparse.ArgumentParser(description='vits inference') 64 | #必须参数 65 | parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth", help='模型路径') 66 | parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径') 67 | parser.add_argument('-o', '--output_path', type=str, default="output/vits", help='输出文件路径') 68 | parser.add_argument('-l', '--language', type=str, default="日本語", help='输入的语言') 69 | parser.add_argument('-t', '--text', type=str, help='输入文本') 70 | parser.add_argument('-s', '--spk', type=str, help='合成目标说话人名称') 71 | #可选参数 72 | parser.add_argument('-on', '--output_name', type=str, default="output", help='输出文件的名称') 73 | parser.add_argument('-ns', '--noise_scale', type=float,default= .667,help='感情变化程度') 74 | parser.add_argument('-nsw', '--noise_scale_w', type=float,default=0.6, help='音素发音长度') 75 | parser.add_argument('-ls', '--length_scale', type=float,default=1, help='整体语速') 76 | 77 | args = parser.parse_args() 78 | 79 | model_path = args.model_path 80 | config_path = args.config_path 81 | output_dir = Path(args.output_path) 82 | output_dir.mkdir(parents=True, exist_ok=True) 83 | 84 | language = args.language 85 | text = args.text 86 | spk = args.spk 87 | noise_scale = args.noise_scale 88 | noise_scale_w = args.noise_scale_w 89 | length = args.length_scale 90 | output_name = args.output_name 91 | 92 | hps = utils.get_hparams_from_file(config_path) 93 | net_g = SynthesizerTrn( 94 | len(hps.symbols), 95 | hps.data.filter_length // 2 + 1, 96 | hps.train.segment_size // hps.data.hop_length, 97 | n_speakers=hps.data.n_speakers, 98 | **hps.model).to(device) 99 | _ = net_g.eval() 100 | _ = utils.load_checkpoint(model_path, net_g, None) 101 | 102 | speaker_ids = hps.speakers 103 | 104 | 105 | if language is not None: 106 | text = language_marks[language] + text + language_marks[language] 107 | speaker_id = speaker_ids[spk] 108 | stn_tst = get_text(text, hps, False) 109 | with no_grad(): 110 | x_tst = stn_tst.unsqueeze(0).to(device) 111 | x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device) 112 | sid = LongTensor([speaker_id]).to(device) 113 | audio = net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=noise_scale, noise_scale_w=noise_scale_w, 114 | length_scale=1.0 / length)[0][0, 0].data.cpu().float().numpy() 115 | del stn_tst, x_tst, x_tst_lengths, sid 116 | 117 | wavf.write(str(output_dir)+"/"+output_name+".wav",hps.data.sampling_rate,audio) 118 | 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /commons.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import torch 4 | from torch import nn 5 | from torch.nn import functional as F 6 | 7 | 8 | def init_weights(m, mean=0.0, std=0.01): 9 | classname = m.__class__.__name__ 10 | if classname.find("Conv") != -1: 11 | m.weight.data.normal_(mean, std) 12 | 13 | 14 | def get_padding(kernel_size, dilation=1): 15 | return int((kernel_size*dilation - dilation)/2) 16 | 17 | 18 | def convert_pad_shape(pad_shape): 19 | l = pad_shape[::-1] 20 | pad_shape = [item for sublist in l for item in sublist] 21 | return pad_shape 22 | 23 | 24 | def intersperse(lst, item): 25 | result = [item] * (len(lst) * 2 + 1) 26 | result[1::2] = lst 27 | return result 28 | 29 | 30 | def kl_divergence(m_p, logs_p, m_q, logs_q): 31 | """KL(P||Q)""" 32 | kl = (logs_q - logs_p) - 0.5 33 | kl += 0.5 * (torch.exp(2. * logs_p) + ((m_p - m_q)**2)) * torch.exp(-2. * logs_q) 34 | return kl 35 | 36 | 37 | def rand_gumbel(shape): 38 | """Sample from the Gumbel distribution, protect from overflows.""" 39 | uniform_samples = torch.rand(shape) * 0.99998 + 0.00001 40 | return -torch.log(-torch.log(uniform_samples)) 41 | 42 | 43 | def rand_gumbel_like(x): 44 | g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device) 45 | return g 46 | 47 | 48 | def slice_segments(x, ids_str, segment_size=4): 49 | ret = torch.zeros_like(x[:, :, :segment_size]) 50 | for i in range(x.size(0)): 51 | idx_str = ids_str[i] 52 | idx_end = idx_str + segment_size 53 | try: 54 | ret[i] = x[i, :, idx_str:idx_end] 55 | except RuntimeError: 56 | print("?") 57 | return ret 58 | 59 | 60 | def rand_slice_segments(x, x_lengths=None, segment_size=4): 61 | b, d, t = x.size() 62 | if x_lengths is None: 63 | x_lengths = t 64 | ids_str_max = x_lengths - segment_size + 1 65 | ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long) 66 | ret = slice_segments(x, ids_str, segment_size) 67 | return ret, ids_str 68 | 69 | 70 | def get_timing_signal_1d( 71 | length, channels, min_timescale=1.0, max_timescale=1.0e4): 72 | position = torch.arange(length, dtype=torch.float) 73 | num_timescales = channels // 2 74 | log_timescale_increment = ( 75 | math.log(float(max_timescale) / float(min_timescale)) / 76 | (num_timescales - 1)) 77 | inv_timescales = min_timescale * torch.exp( 78 | torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment) 79 | scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1) 80 | signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0) 81 | signal = F.pad(signal, [0, 0, 0, channels % 2]) 82 | signal = signal.view(1, channels, length) 83 | return signal 84 | 85 | 86 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4): 87 | b, channels, length = x.size() 88 | signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale) 89 | return x + signal.to(dtype=x.dtype, device=x.device) 90 | 91 | 92 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1): 93 | b, channels, length = x.size() 94 | signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale) 95 | return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis) 96 | 97 | 98 | def subsequent_mask(length): 99 | mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0) 100 | return mask 101 | 102 | 103 | @torch.jit.script 104 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels): 105 | n_channels_int = n_channels[0] 106 | in_act = input_a + input_b 107 | t_act = torch.tanh(in_act[:, :n_channels_int, :]) 108 | s_act = torch.sigmoid(in_act[:, n_channels_int:, :]) 109 | acts = t_act * s_act 110 | return acts 111 | 112 | 113 | def convert_pad_shape(pad_shape): 114 | l = pad_shape[::-1] 115 | pad_shape = [item for sublist in l for item in sublist] 116 | return pad_shape 117 | 118 | 119 | def shift_1d(x): 120 | x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1] 121 | return x 122 | 123 | 124 | def sequence_mask(length, max_length=None): 125 | if max_length is None: 126 | max_length = length.max() 127 | x = torch.arange(max_length, dtype=length.dtype, device=length.device) 128 | return x.unsqueeze(0) < length.unsqueeze(1) 129 | 130 | 131 | def generate_path(duration, mask): 132 | """ 133 | duration: [b, 1, t_x] 134 | mask: [b, 1, t_y, t_x] 135 | """ 136 | device = duration.device 137 | 138 | b, _, t_y, t_x = mask.shape 139 | cum_duration = torch.cumsum(duration, -1) 140 | 141 | cum_duration_flat = cum_duration.view(b * t_x) 142 | path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype) 143 | path = path.view(b, t_x, t_y) 144 | path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1] 145 | path = path.unsqueeze(1).transpose(2,3) * mask 146 | return path 147 | 148 | 149 | def clip_grad_value_(parameters, clip_value, norm_type=2): 150 | if isinstance(parameters, torch.Tensor): 151 | parameters = [parameters] 152 | parameters = list(filter(lambda p: p.grad is not None, parameters)) 153 | norm_type = float(norm_type) 154 | if clip_value is not None: 155 | clip_value = float(clip_value) 156 | 157 | total_norm = 0 158 | for p in parameters: 159 | param_norm = p.grad.data.norm(norm_type) 160 | total_norm += param_norm.item() ** norm_type 161 | if clip_value is not None: 162 | p.grad.data.clamp_(min=-clip_value, max=clip_value) 163 | total_norm = total_norm ** (1. / norm_type) 164 | return total_norm 165 | -------------------------------------------------------------------------------- /configs/modified_finetune_speaker.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 10, 4 | "eval_interval": 100, 5 | "seed": 1234, 6 | "epochs": 10000, 7 | "learning_rate": 0.0002, 8 | "betas": [ 9 | 0.8, 10 | 0.99 11 | ], 12 | "eps": 1e-09, 13 | "batch_size": 16, 14 | "fp16_run": true, 15 | "lr_decay": 0.999875, 16 | "segment_size": 8192, 17 | "init_lr_ratio": 1, 18 | "warmup_epochs": 0, 19 | "c_mel": 45, 20 | "c_kl": 1.0 21 | }, 22 | "data": { 23 | "training_files": "final_annotation_train.txt", 24 | "validation_files": "final_annotation_val.txt", 25 | "text_cleaners": [ 26 | "chinese_cleaners" 27 | ], 28 | "max_wav_value": 32768.0, 29 | "sampling_rate": 22050, 30 | "filter_length": 1024, 31 | "hop_length": 256, 32 | "win_length": 1024, 33 | "n_mel_channels": 80, 34 | "mel_fmin": 0.0, 35 | "mel_fmax": null, 36 | "add_blank": true, 37 | "n_speakers": 2, 38 | "cleaned_text": true 39 | }, 40 | "model": { 41 | "inter_channels": 192, 42 | "hidden_channels": 192, 43 | "filter_channels": 768, 44 | "n_heads": 2, 45 | "n_layers": 6, 46 | "kernel_size": 3, 47 | "p_dropout": 0.1, 48 | "resblock": "1", 49 | "resblock_kernel_sizes": [ 50 | 3, 51 | 7, 52 | 11 53 | ], 54 | "resblock_dilation_sizes": [ 55 | [ 56 | 1, 57 | 3, 58 | 5 59 | ], 60 | [ 61 | 1, 62 | 3, 63 | 5 64 | ], 65 | [ 66 | 1, 67 | 3, 68 | 5 69 | ] 70 | ], 71 | "upsample_rates": [ 72 | 8, 73 | 8, 74 | 2, 75 | 2 76 | ], 77 | "upsample_initial_channel": 512, 78 | "upsample_kernel_sizes": [ 79 | 16, 80 | 16, 81 | 4, 82 | 4 83 | ], 84 | "n_layers_q": 3, 85 | "use_spectral_norm": false, 86 | "gin_channels": 256 87 | }, 88 | "symbols": [ 89 | "_", 90 | "\uff1b", 91 | "\uff1a", 92 | "\uff0c", 93 | "\u3002", 94 | "\uff01", 95 | "\uff1f", 96 | "-", 97 | "\u201c", 98 | "\u201d", 99 | "\u300a", 100 | "\u300b", 101 | "\u3001", 102 | "\uff08", 103 | "\uff09", 104 | "\u2026", 105 | "\u2014", 106 | " ", 107 | "A", 108 | "B", 109 | "C", 110 | "D", 111 | "E", 112 | "F", 113 | "G", 114 | "H", 115 | "I", 116 | "J", 117 | "K", 118 | "L", 119 | "M", 120 | "N", 121 | "O", 122 | "P", 123 | "Q", 124 | "R", 125 | "S", 126 | "T", 127 | "U", 128 | "V", 129 | "W", 130 | "X", 131 | "Y", 132 | "Z", 133 | "a", 134 | "b", 135 | "c", 136 | "d", 137 | "e", 138 | "f", 139 | "g", 140 | "h", 141 | "i", 142 | "j", 143 | "k", 144 | "l", 145 | "m", 146 | "n", 147 | "o", 148 | "p", 149 | "q", 150 | "r", 151 | "s", 152 | "t", 153 | "u", 154 | "v", 155 | "w", 156 | "x", 157 | "y", 158 | "z", 159 | "1", 160 | "2", 161 | "3", 162 | "4", 163 | "5", 164 | "0", 165 | "\uff22", 166 | "\uff30" 167 | ], 168 | "speakers": { 169 | "dingzhen": 0, 170 | "taffy": 1 171 | } 172 | } -------------------------------------------------------------------------------- /configs/uma_trilingual.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 200, 4 | "eval_interval": 1000, 5 | "seed": 1234, 6 | "epochs": 10000, 7 | "learning_rate": 2e-4, 8 | "betas": [0.8, 0.99], 9 | "eps": 1e-9, 10 | "batch_size": 16, 11 | "fp16_run": true, 12 | "lr_decay": 0.999875, 13 | "segment_size": 8192, 14 | "init_lr_ratio": 1, 15 | "warmup_epochs": 0, 16 | "c_mel": 45, 17 | "c_kl": 1.0 18 | }, 19 | "data": { 20 | "training_files":"../CH_JA_EN_mix_voice/clipped_3_vits_trilingual_annotations.train.txt.cleaned", 21 | "validation_files":"../CH_JA_EN_mix_voice/clipped_3_vits_trilingual_annotations.val.txt.cleaned", 22 | "text_cleaners":["cjke_cleaners2"], 23 | "max_wav_value": 32768.0, 24 | "sampling_rate": 22050, 25 | "filter_length": 1024, 26 | "hop_length": 256, 27 | "win_length": 1024, 28 | "n_mel_channels": 80, 29 | "mel_fmin": 0.0, 30 | "mel_fmax": null, 31 | "add_blank": true, 32 | "n_speakers": 999, 33 | "cleaned_text": true 34 | }, 35 | "model": { 36 | "inter_channels": 192, 37 | "hidden_channels": 192, 38 | "filter_channels": 768, 39 | "n_heads": 2, 40 | "n_layers": 6, 41 | "kernel_size": 3, 42 | "p_dropout": 0.1, 43 | "resblock": "1", 44 | "resblock_kernel_sizes": [3,7,11], 45 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 46 | "upsample_rates": [8,8,2,2], 47 | "upsample_initial_channel": 512, 48 | "upsample_kernel_sizes": [16,16,4,4], 49 | "n_layers_q": 3, 50 | "use_spectral_norm": false, 51 | "gin_channels": 256 52 | }, 53 | "symbols": ["_", ",", ".", "!", "?", "-", "~", "\u2026", "N", "Q", "a", "b", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "s", "t", "u", "v", "w", "x", "y", "z", "\u0251", "\u00e6", "\u0283", "\u0291", "\u00e7", "\u026f", "\u026a", "\u0254", "\u025b", "\u0279", "\u00f0", "\u0259", "\u026b", "\u0265", "\u0278", "\u028a", "\u027e", "\u0292", "\u03b8", "\u03b2", "\u014b", "\u0266", "\u207c", "\u02b0", "`", "^", "#", "*", "=", "\u02c8", "\u02cc", "\u2192", "\u2193", "\u2191", " "] 54 | } -------------------------------------------------------------------------------- /data_utils.py: -------------------------------------------------------------------------------- 1 | import time 2 | import os 3 | import random 4 | import numpy as np 5 | import torch 6 | import torch.utils.data 7 | import torchaudio 8 | 9 | import commons 10 | from mel_processing import spectrogram_torch 11 | from utils import load_wav_to_torch, load_filepaths_and_text 12 | from text import text_to_sequence, cleaned_text_to_sequence 13 | """Multi speaker version""" 14 | 15 | 16 | class TextAudioSpeakerLoader(torch.utils.data.Dataset): 17 | """ 18 | 1) loads audio, speaker_id, text pairs 19 | 2) normalizes text and converts them to sequences of integers 20 | 3) computes spectrograms from audio files. 21 | """ 22 | 23 | def __init__(self, audiopaths_sid_text, hparams, symbols): 24 | self.audiopaths_sid_text = load_filepaths_and_text(audiopaths_sid_text) 25 | self.text_cleaners = hparams.text_cleaners 26 | self.max_wav_value = hparams.max_wav_value 27 | self.sampling_rate = hparams.sampling_rate 28 | self.filter_length = hparams.filter_length 29 | self.hop_length = hparams.hop_length 30 | self.win_length = hparams.win_length 31 | self.sampling_rate = hparams.sampling_rate 32 | 33 | self.cleaned_text = getattr(hparams, "cleaned_text", False) 34 | 35 | self.add_blank = hparams.add_blank 36 | self.min_text_len = getattr(hparams, "min_text_len", 1) 37 | self.max_text_len = getattr(hparams, "max_text_len", 190) 38 | self.symbols = symbols 39 | 40 | random.seed(1234) 41 | random.shuffle(self.audiopaths_sid_text) 42 | self._filter() 43 | 44 | def _filter(self): 45 | """ 46 | Filter text & store spec lengths 47 | """ 48 | # Store spectrogram lengths for Bucketing 49 | # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2) 50 | # spec_length = wav_length // hop_length 51 | 52 | audiopaths_sid_text_new = [] 53 | lengths = [] 54 | for audiopath, sid, text in self.audiopaths_sid_text: 55 | # audiopath = "./user_voice/" + audiopath 56 | 57 | if self.min_text_len <= len(text) and len(text) <= self.max_text_len: 58 | audiopaths_sid_text_new.append([audiopath, sid, text]) 59 | lengths.append(os.path.getsize(audiopath) // (2 * self.hop_length)) 60 | self.audiopaths_sid_text = audiopaths_sid_text_new 61 | self.lengths = lengths 62 | 63 | def get_audio_text_speaker_pair(self, audiopath_sid_text): 64 | # separate filename, speaker_id and text 65 | audiopath, sid, text = audiopath_sid_text[0], audiopath_sid_text[1], audiopath_sid_text[2] 66 | text = self.get_text(text) 67 | spec, wav = self.get_audio(audiopath) 68 | sid = self.get_sid(sid) 69 | return (text, spec, wav, sid) 70 | 71 | def get_audio(self, filename): 72 | # audio, sampling_rate = load_wav_to_torch(filename) 73 | # if sampling_rate != self.sampling_rate: 74 | # raise ValueError("{} {} SR doesn't match target {} SR".format( 75 | # sampling_rate, self.sampling_rate)) 76 | # audio_norm = audio / self.max_wav_value if audio.max() > 10 else audio 77 | # audio_norm = audio_norm.unsqueeze(0) 78 | audio_norm, sampling_rate = torchaudio.load(filename, frame_offset=0, num_frames=-1, normalize=True, channels_first=True) 79 | # spec_filename = filename.replace(".wav", ".spec.pt") 80 | # if os.path.exists(spec_filename): 81 | # spec = torch.load(spec_filename) 82 | # else: 83 | # try: 84 | spec = spectrogram_torch(audio_norm, self.filter_length, 85 | self.sampling_rate, self.hop_length, self.win_length, 86 | center=False) 87 | spec = spec.squeeze(0) 88 | # except NotImplementedError: 89 | # print("?") 90 | # spec = torch.squeeze(spec, 0) 91 | # torch.save(spec, spec_filename) 92 | return spec, audio_norm 93 | 94 | def get_text(self, text): 95 | if self.cleaned_text: 96 | text_norm = cleaned_text_to_sequence(text, self.symbols) 97 | else: 98 | text_norm = text_to_sequence(text, self.text_cleaners) 99 | if self.add_blank: 100 | text_norm = commons.intersperse(text_norm, 0) 101 | text_norm = torch.LongTensor(text_norm) 102 | return text_norm 103 | 104 | def get_sid(self, sid): 105 | sid = torch.LongTensor([int(sid)]) 106 | return sid 107 | 108 | def __getitem__(self, index): 109 | return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index]) 110 | 111 | def __len__(self): 112 | return len(self.audiopaths_sid_text) 113 | 114 | 115 | class TextAudioSpeakerCollate(): 116 | """ Zero-pads model inputs and targets 117 | """ 118 | 119 | def __init__(self, return_ids=False): 120 | self.return_ids = return_ids 121 | 122 | def __call__(self, batch): 123 | """Collate's training batch from normalized text, audio and speaker identities 124 | PARAMS 125 | ------ 126 | batch: [text_normalized, spec_normalized, wav_normalized, sid] 127 | """ 128 | # Right zero-pad all one-hot text sequences to max input length 129 | _, ids_sorted_decreasing = torch.sort( 130 | torch.LongTensor([x[1].size(1) for x in batch]), 131 | dim=0, descending=True) 132 | 133 | max_text_len = max([len(x[0]) for x in batch]) 134 | max_spec_len = max([x[1].size(1) for x in batch]) 135 | max_wav_len = max([x[2].size(1) for x in batch]) 136 | 137 | text_lengths = torch.LongTensor(len(batch)) 138 | spec_lengths = torch.LongTensor(len(batch)) 139 | wav_lengths = torch.LongTensor(len(batch)) 140 | sid = torch.LongTensor(len(batch)) 141 | 142 | text_padded = torch.LongTensor(len(batch), max_text_len) 143 | spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0), max_spec_len) 144 | wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len) 145 | text_padded.zero_() 146 | spec_padded.zero_() 147 | wav_padded.zero_() 148 | for i in range(len(ids_sorted_decreasing)): 149 | row = batch[ids_sorted_decreasing[i]] 150 | 151 | text = row[0] 152 | text_padded[i, :text.size(0)] = text 153 | text_lengths[i] = text.size(0) 154 | 155 | spec = row[1] 156 | spec_padded[i, :, :spec.size(1)] = spec 157 | spec_lengths[i] = spec.size(1) 158 | 159 | wav = row[2] 160 | wav_padded[i, :, :wav.size(1)] = wav 161 | wav_lengths[i] = wav.size(1) 162 | 163 | sid[i] = row[3] 164 | 165 | if self.return_ids: 166 | return text_padded, text_lengths, spec_padded, spec_lengths, wav_padded, wav_lengths, sid, ids_sorted_decreasing 167 | return text_padded, text_lengths, spec_padded, spec_lengths, wav_padded, wav_lengths, sid 168 | 169 | 170 | class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler): 171 | """ 172 | Maintain similar input lengths in a batch. 173 | Length groups are specified by boundaries. 174 | Ex) boundaries = [b1, b2, b3] -> any batch is included either {x | b1 < length(x) <=b2} or {x | b2 < length(x) <= b3}. 175 | 176 | It removes samples which are not included in the boundaries. 177 | Ex) boundaries = [b1, b2, b3] -> any x s.t. length(x) <= b1 or length(x) > b3 are discarded. 178 | """ 179 | 180 | def __init__(self, dataset, batch_size, boundaries, num_replicas=None, rank=None, shuffle=True): 181 | super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle) 182 | self.lengths = dataset.lengths 183 | self.batch_size = batch_size 184 | self.boundaries = boundaries 185 | 186 | self.buckets, self.num_samples_per_bucket = self._create_buckets() 187 | self.total_size = sum(self.num_samples_per_bucket) 188 | self.num_samples = self.total_size // self.num_replicas 189 | 190 | def _create_buckets(self): 191 | buckets = [[] for _ in range(len(self.boundaries) - 1)] 192 | for i in range(len(self.lengths)): 193 | length = self.lengths[i] 194 | idx_bucket = self._bisect(length) 195 | if idx_bucket != -1: 196 | buckets[idx_bucket].append(i) 197 | 198 | try: 199 | for i in range(len(buckets) - 1, 0, -1): 200 | if len(buckets[i]) == 0: 201 | buckets.pop(i) 202 | self.boundaries.pop(i + 1) 203 | assert all(len(bucket) > 0 for bucket in buckets) 204 | # When one bucket is not traversed 205 | except Exception as e: 206 | print('Bucket warning ', e) 207 | for i in range(len(buckets) - 1, -1, -1): 208 | if len(buckets[i]) == 0: 209 | buckets.pop(i) 210 | self.boundaries.pop(i + 1) 211 | 212 | num_samples_per_bucket = [] 213 | for i in range(len(buckets)): 214 | len_bucket = len(buckets[i]) 215 | total_batch_size = self.num_replicas * self.batch_size 216 | rem = (total_batch_size - (len_bucket % total_batch_size)) % total_batch_size 217 | num_samples_per_bucket.append(len_bucket + rem) 218 | return buckets, num_samples_per_bucket 219 | 220 | def __iter__(self): 221 | # deterministically shuffle based on epoch 222 | g = torch.Generator() 223 | g.manual_seed(self.epoch) 224 | 225 | indices = [] 226 | if self.shuffle: 227 | for bucket in self.buckets: 228 | indices.append(torch.randperm(len(bucket), generator=g).tolist()) 229 | else: 230 | for bucket in self.buckets: 231 | indices.append(list(range(len(bucket)))) 232 | 233 | batches = [] 234 | for i in range(len(self.buckets)): 235 | bucket = self.buckets[i] 236 | len_bucket = len(bucket) 237 | ids_bucket = indices[i] 238 | num_samples_bucket = self.num_samples_per_bucket[i] 239 | 240 | # add extra samples to make it evenly divisible 241 | rem = num_samples_bucket - len_bucket 242 | ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)] 243 | 244 | # subsample 245 | ids_bucket = ids_bucket[self.rank::self.num_replicas] 246 | 247 | # batching 248 | for j in range(len(ids_bucket) // self.batch_size): 249 | batch = [bucket[idx] for idx in ids_bucket[j * self.batch_size:(j + 1) * self.batch_size]] 250 | batches.append(batch) 251 | 252 | if self.shuffle: 253 | batch_ids = torch.randperm(len(batches), generator=g).tolist() 254 | batches = [batches[i] for i in batch_ids] 255 | self.batches = batches 256 | 257 | assert len(self.batches) * self.batch_size == self.num_samples 258 | return iter(self.batches) 259 | 260 | def _bisect(self, x, lo=0, hi=None): 261 | if hi is None: 262 | hi = len(self.boundaries) - 1 263 | 264 | if hi > lo: 265 | mid = (hi + lo) // 2 266 | if self.boundaries[mid] < x and x <= self.boundaries[mid + 1]: 267 | return mid 268 | elif x <= self.boundaries[mid]: 269 | return self._bisect(x, lo, mid) 270 | else: 271 | return self._bisect(x, mid + 1, hi) 272 | else: 273 | return -1 274 | 275 | def __len__(self): 276 | return self.num_samples // self.batch_size 277 | -------------------------------------------------------------------------------- /losses.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.nn import functional as F 3 | 4 | import commons 5 | 6 | 7 | def feature_loss(fmap_r, fmap_g): 8 | loss = 0 9 | for dr, dg in zip(fmap_r, fmap_g): 10 | for rl, gl in zip(dr, dg): 11 | rl = rl.float().detach() 12 | gl = gl.float() 13 | loss += torch.mean(torch.abs(rl - gl)) 14 | 15 | return loss * 2 16 | 17 | 18 | def discriminator_loss(disc_real_outputs, disc_generated_outputs): 19 | loss = 0 20 | r_losses = [] 21 | g_losses = [] 22 | for dr, dg in zip(disc_real_outputs, disc_generated_outputs): 23 | dr = dr.float() 24 | dg = dg.float() 25 | r_loss = torch.mean((1-dr)**2) 26 | g_loss = torch.mean(dg**2) 27 | loss += (r_loss + g_loss) 28 | r_losses.append(r_loss.item()) 29 | g_losses.append(g_loss.item()) 30 | 31 | return loss, r_losses, g_losses 32 | 33 | 34 | def generator_loss(disc_outputs): 35 | loss = 0 36 | gen_losses = [] 37 | for dg in disc_outputs: 38 | dg = dg.float() 39 | l = torch.mean((1-dg)**2) 40 | gen_losses.append(l) 41 | loss += l 42 | 43 | return loss, gen_losses 44 | 45 | 46 | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask): 47 | """ 48 | z_p, logs_q: [b, h, t_t] 49 | m_p, logs_p: [b, h, t_t] 50 | """ 51 | z_p = z_p.float() 52 | logs_q = logs_q.float() 53 | m_p = m_p.float() 54 | logs_p = logs_p.float() 55 | z_mask = z_mask.float() 56 | 57 | kl = logs_p - logs_q - 0.5 58 | kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p) 59 | kl = torch.sum(kl * z_mask) 60 | l = kl / torch.sum(z_mask) 61 | return l 62 | -------------------------------------------------------------------------------- /mel_processing.py: -------------------------------------------------------------------------------- 1 | import math 2 | import os 3 | import random 4 | import torch 5 | from torch import nn 6 | import torch.nn.functional as F 7 | import torch.utils.data 8 | import numpy as np 9 | import librosa 10 | import librosa.util as librosa_util 11 | from librosa.util import normalize, pad_center, tiny 12 | from scipy.signal import get_window 13 | from scipy.io.wavfile import read 14 | from librosa.filters import mel as librosa_mel_fn 15 | 16 | MAX_WAV_VALUE = 32768.0 17 | 18 | 19 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5): 20 | """ 21 | PARAMS 22 | ------ 23 | C: compression factor 24 | """ 25 | return torch.log(torch.clamp(x, min=clip_val) * C) 26 | 27 | 28 | def dynamic_range_decompression_torch(x, C=1): 29 | """ 30 | PARAMS 31 | ------ 32 | C: compression factor used to compress 33 | """ 34 | return torch.exp(x) / C 35 | 36 | 37 | def spectral_normalize_torch(magnitudes): 38 | output = dynamic_range_compression_torch(magnitudes) 39 | return output 40 | 41 | 42 | def spectral_de_normalize_torch(magnitudes): 43 | output = dynamic_range_decompression_torch(magnitudes) 44 | return output 45 | 46 | 47 | mel_basis = {} 48 | hann_window = {} 49 | 50 | 51 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False): 52 | if torch.min(y) < -1.: 53 | print('min value is ', torch.min(y)) 54 | if torch.max(y) > 1.: 55 | print('max value is ', torch.max(y)) 56 | 57 | global hann_window 58 | dtype_device = str(y.dtype) + '_' + str(y.device) 59 | wnsize_dtype_device = str(win_size) + '_' + dtype_device 60 | if wnsize_dtype_device not in hann_window: 61 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device) 62 | 63 | y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') 64 | y = y.squeeze(1) 65 | 66 | spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], 67 | center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False) 68 | 69 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 70 | return spec 71 | 72 | 73 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax): 74 | global mel_basis 75 | dtype_device = str(spec.dtype) + '_' + str(spec.device) 76 | fmax_dtype_device = str(fmax) + '_' + dtype_device 77 | if fmax_dtype_device not in mel_basis: 78 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 79 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=spec.dtype, device=spec.device) 80 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 81 | spec = spectral_normalize_torch(spec) 82 | return spec 83 | 84 | 85 | def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False): 86 | if torch.min(y) < -1.: 87 | print('min value is ', torch.min(y)) 88 | if torch.max(y) > 1.: 89 | print('max value is ', torch.max(y)) 90 | 91 | global mel_basis, hann_window 92 | dtype_device = str(y.dtype) + '_' + str(y.device) 93 | fmax_dtype_device = str(fmax) + '_' + dtype_device 94 | wnsize_dtype_device = str(win_size) + '_' + dtype_device 95 | if fmax_dtype_device not in mel_basis: 96 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 97 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device) 98 | if wnsize_dtype_device not in hann_window: 99 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device) 100 | 101 | y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') 102 | y = y.squeeze(1) 103 | 104 | spec = torch.stft(y.float(), n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], 105 | center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False) 106 | 107 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 108 | 109 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 110 | spec = spectral_normalize_torch(spec) 111 | 112 | return spec 113 | -------------------------------------------------------------------------------- /modules.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import math 3 | import numpy as np 4 | import scipy 5 | import torch 6 | from torch import nn 7 | from torch.nn import functional as F 8 | 9 | from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d 10 | from torch.nn.utils import weight_norm, remove_weight_norm 11 | 12 | import commons 13 | from commons import init_weights, get_padding 14 | from transforms import piecewise_rational_quadratic_transform 15 | 16 | 17 | LRELU_SLOPE = 0.1 18 | 19 | 20 | class LayerNorm(nn.Module): 21 | def __init__(self, channels, eps=1e-5): 22 | super().__init__() 23 | self.channels = channels 24 | self.eps = eps 25 | 26 | self.gamma = nn.Parameter(torch.ones(channels)) 27 | self.beta = nn.Parameter(torch.zeros(channels)) 28 | 29 | def forward(self, x): 30 | x = x.transpose(1, -1) 31 | x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps) 32 | return x.transpose(1, -1) 33 | 34 | 35 | class ConvReluNorm(nn.Module): 36 | def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout): 37 | super().__init__() 38 | self.in_channels = in_channels 39 | self.hidden_channels = hidden_channels 40 | self.out_channels = out_channels 41 | self.kernel_size = kernel_size 42 | self.n_layers = n_layers 43 | self.p_dropout = p_dropout 44 | assert n_layers > 1, "Number of layers should be larger than 0." 45 | 46 | self.conv_layers = nn.ModuleList() 47 | self.norm_layers = nn.ModuleList() 48 | self.conv_layers.append(nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size//2)) 49 | self.norm_layers.append(LayerNorm(hidden_channels)) 50 | self.relu_drop = nn.Sequential( 51 | nn.ReLU(), 52 | nn.Dropout(p_dropout)) 53 | for _ in range(n_layers-1): 54 | self.conv_layers.append(nn.Conv1d(hidden_channels, hidden_channels, kernel_size, padding=kernel_size//2)) 55 | self.norm_layers.append(LayerNorm(hidden_channels)) 56 | self.proj = nn.Conv1d(hidden_channels, out_channels, 1) 57 | self.proj.weight.data.zero_() 58 | self.proj.bias.data.zero_() 59 | 60 | def forward(self, x, x_mask): 61 | x_org = x 62 | for i in range(self.n_layers): 63 | x = self.conv_layers[i](x * x_mask) 64 | x = self.norm_layers[i](x) 65 | x = self.relu_drop(x) 66 | x = x_org + self.proj(x) 67 | return x * x_mask 68 | 69 | 70 | class DDSConv(nn.Module): 71 | """ 72 | Dilated and Depth-Separable Convolution 73 | """ 74 | def __init__(self, channels, kernel_size, n_layers, p_dropout=0.): 75 | super().__init__() 76 | self.channels = channels 77 | self.kernel_size = kernel_size 78 | self.n_layers = n_layers 79 | self.p_dropout = p_dropout 80 | 81 | self.drop = nn.Dropout(p_dropout) 82 | self.convs_sep = nn.ModuleList() 83 | self.convs_1x1 = nn.ModuleList() 84 | self.norms_1 = nn.ModuleList() 85 | self.norms_2 = nn.ModuleList() 86 | for i in range(n_layers): 87 | dilation = kernel_size ** i 88 | padding = (kernel_size * dilation - dilation) // 2 89 | self.convs_sep.append(nn.Conv1d(channels, channels, kernel_size, 90 | groups=channels, dilation=dilation, padding=padding 91 | )) 92 | self.convs_1x1.append(nn.Conv1d(channels, channels, 1)) 93 | self.norms_1.append(LayerNorm(channels)) 94 | self.norms_2.append(LayerNorm(channels)) 95 | 96 | def forward(self, x, x_mask, g=None): 97 | if g is not None: 98 | x = x + g 99 | for i in range(self.n_layers): 100 | y = self.convs_sep[i](x * x_mask) 101 | y = self.norms_1[i](y) 102 | y = F.gelu(y) 103 | y = self.convs_1x1[i](y) 104 | y = self.norms_2[i](y) 105 | y = F.gelu(y) 106 | y = self.drop(y) 107 | x = x + y 108 | return x * x_mask 109 | 110 | 111 | class WN(torch.nn.Module): 112 | def __init__(self, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0): 113 | super(WN, self).__init__() 114 | assert(kernel_size % 2 == 1) 115 | self.hidden_channels =hidden_channels 116 | self.kernel_size = kernel_size, 117 | self.dilation_rate = dilation_rate 118 | self.n_layers = n_layers 119 | self.gin_channels = gin_channels 120 | self.p_dropout = p_dropout 121 | 122 | self.in_layers = torch.nn.ModuleList() 123 | self.res_skip_layers = torch.nn.ModuleList() 124 | self.drop = nn.Dropout(p_dropout) 125 | 126 | if gin_channels != 0: 127 | cond_layer = torch.nn.Conv1d(gin_channels, 2*hidden_channels*n_layers, 1) 128 | self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight') 129 | 130 | for i in range(n_layers): 131 | dilation = dilation_rate ** i 132 | padding = int((kernel_size * dilation - dilation) / 2) 133 | in_layer = torch.nn.Conv1d(hidden_channels, 2*hidden_channels, kernel_size, 134 | dilation=dilation, padding=padding) 135 | in_layer = torch.nn.utils.weight_norm(in_layer, name='weight') 136 | self.in_layers.append(in_layer) 137 | 138 | # last one is not necessary 139 | if i < n_layers - 1: 140 | res_skip_channels = 2 * hidden_channels 141 | else: 142 | res_skip_channels = hidden_channels 143 | 144 | res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1) 145 | res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight') 146 | self.res_skip_layers.append(res_skip_layer) 147 | 148 | def forward(self, x, x_mask, g=None, **kwargs): 149 | output = torch.zeros_like(x) 150 | n_channels_tensor = torch.IntTensor([self.hidden_channels]) 151 | 152 | if g is not None: 153 | g = self.cond_layer(g) 154 | 155 | for i in range(self.n_layers): 156 | x_in = self.in_layers[i](x) 157 | if g is not None: 158 | cond_offset = i * 2 * self.hidden_channels 159 | g_l = g[:,cond_offset:cond_offset+2*self.hidden_channels,:] 160 | else: 161 | g_l = torch.zeros_like(x_in) 162 | 163 | acts = commons.fused_add_tanh_sigmoid_multiply( 164 | x_in, 165 | g_l, 166 | n_channels_tensor) 167 | acts = self.drop(acts) 168 | 169 | res_skip_acts = self.res_skip_layers[i](acts) 170 | if i < self.n_layers - 1: 171 | res_acts = res_skip_acts[:,:self.hidden_channels,:] 172 | x = (x + res_acts) * x_mask 173 | output = output + res_skip_acts[:,self.hidden_channels:,:] 174 | else: 175 | output = output + res_skip_acts 176 | return output * x_mask 177 | 178 | def remove_weight_norm(self): 179 | if self.gin_channels != 0: 180 | torch.nn.utils.remove_weight_norm(self.cond_layer) 181 | for l in self.in_layers: 182 | torch.nn.utils.remove_weight_norm(l) 183 | for l in self.res_skip_layers: 184 | torch.nn.utils.remove_weight_norm(l) 185 | 186 | 187 | class ResBlock1(torch.nn.Module): 188 | def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)): 189 | super(ResBlock1, self).__init__() 190 | self.convs1 = nn.ModuleList([ 191 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0], 192 | padding=get_padding(kernel_size, dilation[0]))), 193 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1], 194 | padding=get_padding(kernel_size, dilation[1]))), 195 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2], 196 | padding=get_padding(kernel_size, dilation[2]))) 197 | ]) 198 | self.convs1.apply(init_weights) 199 | 200 | self.convs2 = nn.ModuleList([ 201 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, 202 | padding=get_padding(kernel_size, 1))), 203 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, 204 | padding=get_padding(kernel_size, 1))), 205 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, 206 | padding=get_padding(kernel_size, 1))) 207 | ]) 208 | self.convs2.apply(init_weights) 209 | 210 | def forward(self, x, x_mask=None): 211 | for c1, c2 in zip(self.convs1, self.convs2): 212 | xt = F.leaky_relu(x, LRELU_SLOPE) 213 | if x_mask is not None: 214 | xt = xt * x_mask 215 | xt = c1(xt) 216 | xt = F.leaky_relu(xt, LRELU_SLOPE) 217 | if x_mask is not None: 218 | xt = xt * x_mask 219 | xt = c2(xt) 220 | x = xt + x 221 | if x_mask is not None: 222 | x = x * x_mask 223 | return x 224 | 225 | def remove_weight_norm(self): 226 | for l in self.convs1: 227 | remove_weight_norm(l) 228 | for l in self.convs2: 229 | remove_weight_norm(l) 230 | 231 | 232 | class ResBlock2(torch.nn.Module): 233 | def __init__(self, channels, kernel_size=3, dilation=(1, 3)): 234 | super(ResBlock2, self).__init__() 235 | self.convs = nn.ModuleList([ 236 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0], 237 | padding=get_padding(kernel_size, dilation[0]))), 238 | weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1], 239 | padding=get_padding(kernel_size, dilation[1]))) 240 | ]) 241 | self.convs.apply(init_weights) 242 | 243 | def forward(self, x, x_mask=None): 244 | for c in self.convs: 245 | xt = F.leaky_relu(x, LRELU_SLOPE) 246 | if x_mask is not None: 247 | xt = xt * x_mask 248 | xt = c(xt) 249 | x = xt + x 250 | if x_mask is not None: 251 | x = x * x_mask 252 | return x 253 | 254 | def remove_weight_norm(self): 255 | for l in self.convs: 256 | remove_weight_norm(l) 257 | 258 | 259 | class Log(nn.Module): 260 | def forward(self, x, x_mask, reverse=False, **kwargs): 261 | if not reverse: 262 | y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask 263 | logdet = torch.sum(-y, [1, 2]) 264 | return y, logdet 265 | else: 266 | x = torch.exp(x) * x_mask 267 | return x 268 | 269 | 270 | class Flip(nn.Module): 271 | def forward(self, x, *args, reverse=False, **kwargs): 272 | x = torch.flip(x, [1]) 273 | if not reverse: 274 | logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device) 275 | return x, logdet 276 | else: 277 | return x 278 | 279 | 280 | class ElementwiseAffine(nn.Module): 281 | def __init__(self, channels): 282 | super().__init__() 283 | self.channels = channels 284 | self.m = nn.Parameter(torch.zeros(channels,1)) 285 | self.logs = nn.Parameter(torch.zeros(channels,1)) 286 | 287 | def forward(self, x, x_mask, reverse=False, **kwargs): 288 | if not reverse: 289 | y = self.m + torch.exp(self.logs) * x 290 | y = y * x_mask 291 | logdet = torch.sum(self.logs * x_mask, [1,2]) 292 | return y, logdet 293 | else: 294 | x = (x - self.m) * torch.exp(-self.logs) * x_mask 295 | return x 296 | 297 | 298 | class ResidualCouplingLayer(nn.Module): 299 | def __init__(self, 300 | channels, 301 | hidden_channels, 302 | kernel_size, 303 | dilation_rate, 304 | n_layers, 305 | p_dropout=0, 306 | gin_channels=0, 307 | mean_only=False): 308 | assert channels % 2 == 0, "channels should be divisible by 2" 309 | super().__init__() 310 | self.channels = channels 311 | self.hidden_channels = hidden_channels 312 | self.kernel_size = kernel_size 313 | self.dilation_rate = dilation_rate 314 | self.n_layers = n_layers 315 | self.half_channels = channels // 2 316 | self.mean_only = mean_only 317 | 318 | self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1) 319 | self.enc = WN(hidden_channels, kernel_size, dilation_rate, n_layers, p_dropout=p_dropout, gin_channels=gin_channels) 320 | self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1) 321 | self.post.weight.data.zero_() 322 | self.post.bias.data.zero_() 323 | 324 | def forward(self, x, x_mask, g=None, reverse=False): 325 | x0, x1 = torch.split(x, [self.half_channels]*2, 1) 326 | h = self.pre(x0) * x_mask 327 | h = self.enc(h, x_mask, g=g) 328 | stats = self.post(h) * x_mask 329 | if not self.mean_only: 330 | m, logs = torch.split(stats, [self.half_channels]*2, 1) 331 | else: 332 | m = stats 333 | logs = torch.zeros_like(m) 334 | 335 | if not reverse: 336 | x1 = m + x1 * torch.exp(logs) * x_mask 337 | x = torch.cat([x0, x1], 1) 338 | logdet = torch.sum(logs, [1,2]) 339 | return x, logdet 340 | else: 341 | x1 = (x1 - m) * torch.exp(-logs) * x_mask 342 | x = torch.cat([x0, x1], 1) 343 | return x 344 | 345 | 346 | class ConvFlow(nn.Module): 347 | def __init__(self, in_channels, filter_channels, kernel_size, n_layers, num_bins=10, tail_bound=5.0): 348 | super().__init__() 349 | self.in_channels = in_channels 350 | self.filter_channels = filter_channels 351 | self.kernel_size = kernel_size 352 | self.n_layers = n_layers 353 | self.num_bins = num_bins 354 | self.tail_bound = tail_bound 355 | self.half_channels = in_channels // 2 356 | 357 | self.pre = nn.Conv1d(self.half_channels, filter_channels, 1) 358 | self.convs = DDSConv(filter_channels, kernel_size, n_layers, p_dropout=0.) 359 | self.proj = nn.Conv1d(filter_channels, self.half_channels * (num_bins * 3 - 1), 1) 360 | self.proj.weight.data.zero_() 361 | self.proj.bias.data.zero_() 362 | 363 | def forward(self, x, x_mask, g=None, reverse=False): 364 | x0, x1 = torch.split(x, [self.half_channels]*2, 1) 365 | h = self.pre(x0) 366 | h = self.convs(h, x_mask, g=g) 367 | h = self.proj(h) * x_mask 368 | 369 | b, c, t = x0.shape 370 | h = h.reshape(b, c, -1, t).permute(0, 1, 3, 2) # [b, cx?, t] -> [b, c, t, ?] 371 | 372 | unnormalized_widths = h[..., :self.num_bins] / math.sqrt(self.filter_channels) 373 | unnormalized_heights = h[..., self.num_bins:2*self.num_bins] / math.sqrt(self.filter_channels) 374 | unnormalized_derivatives = h[..., 2 * self.num_bins:] 375 | 376 | x1, logabsdet = piecewise_rational_quadratic_transform(x1, 377 | unnormalized_widths, 378 | unnormalized_heights, 379 | unnormalized_derivatives, 380 | inverse=reverse, 381 | tails='linear', 382 | tail_bound=self.tail_bound 383 | ) 384 | 385 | x = torch.cat([x0, x1], 1) * x_mask 386 | logdet = torch.sum(logabsdet * x_mask, [1,2]) 387 | if not reverse: 388 | return x, logdet 389 | else: 390 | return x 391 | -------------------------------------------------------------------------------- /monotonic_align/__init__.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from .monotonic_align.core import maximum_path_c 4 | 5 | 6 | def maximum_path(neg_cent, mask): 7 | """ Cython optimized version. 8 | neg_cent: [b, t_t, t_s] 9 | mask: [b, t_t, t_s] 10 | """ 11 | device = neg_cent.device 12 | dtype = neg_cent.dtype 13 | neg_cent = neg_cent.data.cpu().numpy().astype(np.float32) 14 | path = np.zeros(neg_cent.shape, dtype=np.int32) 15 | 16 | t_t_max = mask.sum(1)[:, 0].data.cpu().numpy().astype(np.int32) 17 | t_s_max = mask.sum(2)[:, 0].data.cpu().numpy().astype(np.int32) 18 | maximum_path_c(path, neg_cent, t_t_max, t_s_max) 19 | return torch.from_numpy(path).to(device=device, dtype=dtype) 20 | -------------------------------------------------------------------------------- /monotonic_align/core.pyx: -------------------------------------------------------------------------------- 1 | cimport cython 2 | from cython.parallel import prange 3 | 4 | 5 | @cython.boundscheck(False) 6 | @cython.wraparound(False) 7 | cdef void maximum_path_each(int[:,::1] path, float[:,::1] value, int t_y, int t_x, float max_neg_val=-1e9) nogil: 8 | cdef int x 9 | cdef int y 10 | cdef float v_prev 11 | cdef float v_cur 12 | cdef float tmp 13 | cdef int index = t_x - 1 14 | 15 | for y in range(t_y): 16 | for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)): 17 | if x == y: 18 | v_cur = max_neg_val 19 | else: 20 | v_cur = value[y-1, x] 21 | if x == 0: 22 | if y == 0: 23 | v_prev = 0. 24 | else: 25 | v_prev = max_neg_val 26 | else: 27 | v_prev = value[y-1, x-1] 28 | value[y, x] += max(v_prev, v_cur) 29 | 30 | for y in range(t_y - 1, -1, -1): 31 | path[y, index] = 1 32 | if index != 0 and (index == y or value[y-1, index] < value[y-1, index-1]): 33 | index = index - 1 34 | 35 | 36 | @cython.boundscheck(False) 37 | @cython.wraparound(False) 38 | cpdef void maximum_path_c(int[:,:,::1] paths, float[:,:,::1] values, int[::1] t_ys, int[::1] t_xs) nogil: 39 | cdef int b = paths.shape[0] 40 | cdef int i 41 | for i in prange(b, nogil=True): 42 | maximum_path_each(paths[i], values[i], t_ys[i], t_xs[i]) 43 | -------------------------------------------------------------------------------- /monotonic_align/setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | from Cython.Build import cythonize 3 | import numpy 4 | 5 | setup( 6 | name = 'monotonic_align', 7 | ext_modules = cythonize("core.pyx"), 8 | include_dirs=[numpy.get_include()] 9 | ) 10 | -------------------------------------------------------------------------------- /preprocess_v2.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import json 4 | import sys 5 | sys.setrecursionlimit(500000) # Fix the error message of RecursionError: maximum recursion depth exceeded while calling a Python object. You can change the number as you want. 6 | 7 | if __name__ == "__main__": 8 | parser = argparse.ArgumentParser() 9 | parser.add_argument("--add_auxiliary_data", type=bool, help="Whether to add extra data as fine-tuning helper") 10 | parser.add_argument("--languages", default="CJE") 11 | args = parser.parse_args() 12 | if args.languages == "CJE": 13 | langs = ["[ZH]", "[JA]", "[EN]"] 14 | elif args.languages == "CJ": 15 | langs = ["[ZH]", "[JA]"] 16 | elif args.languages == "C": 17 | langs = ["[ZH]"] 18 | new_annos = [] 19 | # Source 1: transcribed short audios 20 | if os.path.exists("short_character_anno.txt"): 21 | with open("short_character_anno.txt", 'r', encoding='utf-8') as f: 22 | short_character_anno = f.readlines() 23 | new_annos += short_character_anno 24 | # Source 2: transcribed long audio segments 25 | if os.path.exists("./long_character_anno.txt"): 26 | with open("./long_character_anno.txt", 'r', encoding='utf-8') as f: 27 | long_character_anno = f.readlines() 28 | new_annos += long_character_anno 29 | 30 | # Get all speaker names 31 | speakers = [] 32 | for line in new_annos: 33 | path, speaker, text = line.split("|") 34 | if speaker not in speakers: 35 | speakers.append(speaker) 36 | assert (len(speakers) != 0), "No audio file found. Please check your uploaded file structure." 37 | # Source 3 (Optional): sampled audios as extra training helpers 38 | if args.add_auxiliary_data: 39 | with open("./sampled_audio4ft.txt", 'r', encoding='utf-8') as f: 40 | old_annos = f.readlines() 41 | # filter old_annos according to supported languages 42 | filtered_old_annos = [] 43 | for line in old_annos: 44 | for lang in langs: 45 | if lang in line: 46 | filtered_old_annos.append(line) 47 | old_annos = filtered_old_annos 48 | for line in old_annos: 49 | path, speaker, text = line.split("|") 50 | if speaker not in speakers: 51 | speakers.append(speaker) 52 | num_old_voices = len(old_annos) 53 | num_new_voices = len(new_annos) 54 | # STEP 1: balance number of new & old voices 55 | cc_duplicate = num_old_voices // num_new_voices 56 | if cc_duplicate == 0: 57 | cc_duplicate = 1 58 | 59 | 60 | # STEP 2: modify config file 61 | with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 62 | hps = json.load(f) 63 | 64 | # assign ids to new speakers 65 | speaker2id = {} 66 | for i, speaker in enumerate(speakers): 67 | speaker2id[speaker] = i 68 | # modify n_speakers 69 | hps['data']["n_speakers"] = len(speakers) 70 | # overwrite speaker names 71 | hps['speakers'] = speaker2id 72 | hps['train']['log_interval'] = 10 73 | hps['train']['eval_interval'] = 100 74 | hps['train']['batch_size'] = 16 75 | hps['data']['training_files'] = "final_annotation_train.txt" 76 | hps['data']['validation_files'] = "final_annotation_val.txt" 77 | # save modified config 78 | with open("./configs/modified_finetune_speaker.json", 'w', encoding='utf-8') as f: 79 | json.dump(hps, f, indent=2) 80 | 81 | # STEP 3: clean annotations, replace speaker names with assigned speaker IDs 82 | import text 83 | cleaned_new_annos = [] 84 | for i, line in enumerate(new_annos): 85 | path, speaker, txt = line.split("|") 86 | if len(txt) > 150: 87 | continue 88 | cleaned_text = text._clean_text(txt, hps['data']['text_cleaners']) 89 | cleaned_text += "\n" if not cleaned_text.endswith("\n") else "" 90 | cleaned_new_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text) 91 | cleaned_old_annos = [] 92 | for i, line in enumerate(old_annos): 93 | path, speaker, txt = line.split("|") 94 | if len(txt) > 150: 95 | continue 96 | cleaned_text = text._clean_text(txt, hps['data']['text_cleaners']) 97 | cleaned_text += "\n" if not cleaned_text.endswith("\n") else "" 98 | cleaned_old_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text) 99 | # merge with old annotation 100 | final_annos = cleaned_old_annos + cc_duplicate * cleaned_new_annos 101 | # save annotation file 102 | with open("./final_annotation_train.txt", 'w', encoding='utf-8') as f: 103 | for line in final_annos: 104 | f.write(line) 105 | # save annotation file for validation 106 | with open("./final_annotation_val.txt", 'w', encoding='utf-8') as f: 107 | for line in cleaned_new_annos: 108 | f.write(line) 109 | print("finished") 110 | else: 111 | # Do not add extra helper data 112 | # STEP 1: modify config file 113 | with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 114 | hps = json.load(f) 115 | 116 | # assign ids to new speakers 117 | speaker2id = {} 118 | for i, speaker in enumerate(speakers): 119 | speaker2id[speaker] = i 120 | # modify n_speakers 121 | hps['data']["n_speakers"] = len(speakers) 122 | # overwrite speaker names 123 | hps['speakers'] = speaker2id 124 | hps['train']['log_interval'] = 10 125 | hps['train']['eval_interval'] = 100 126 | hps['train']['batch_size'] = 16 127 | hps['data']['training_files'] = "final_annotation_train.txt" 128 | hps['data']['validation_files'] = "final_annotation_val.txt" 129 | # save modified config 130 | with open("./configs/modified_finetune_speaker.json", 'w', encoding='utf-8') as f: 131 | json.dump(hps, f, indent=2) 132 | 133 | # STEP 2: clean annotations, replace speaker names with assigned speaker IDs 134 | import text 135 | 136 | cleaned_new_annos = [] 137 | for i, line in enumerate(new_annos): 138 | path, speaker, txt = line.split("|") 139 | if len(txt) > 150: 140 | continue 141 | cleaned_text = text._clean_text(txt, hps['data']['text_cleaners']).replace("[ZH]", "") 142 | cleaned_text += "\n" if not cleaned_text.endswith("\n") else "" 143 | cleaned_new_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text) 144 | 145 | final_annos = cleaned_new_annos 146 | # save annotation file 147 | with open("./final_annotation_train.txt", 'w', encoding='utf-8') as f: 148 | for line in final_annos: 149 | f.write(line) 150 | # save annotation file for validation 151 | with open("./final_annotation_val.txt", 'w', encoding='utf-8') as f: 152 | for line in cleaned_new_annos: 153 | f.write(line) 154 | print("finished") 155 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | Cython==0.29.21 2 | librosa==0.9.2 3 | matplotlib==3.3.1 4 | scikit-learn==1.0.2 5 | scipy 6 | numpy==1.22 7 | tensorboard 8 | torch==2.1.2 9 | torchvision==0.16.2 10 | torchaudio==2.1.2 11 | unidecode 12 | pyopenjtalk-prebuilt 13 | jamo 14 | pypinyin 15 | jieba 16 | protobuf 17 | cn2an 18 | inflect 19 | eng_to_ipa 20 | ko_pron 21 | indic_transliteration==2.3.37 22 | num_thai==0.0.5 23 | opencc==1.1.1 24 | demucs 25 | git+https://github.com/openai/whisper.git 26 | gradio 27 | -------------------------------------------------------------------------------- /scripts/denoise_audio.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import torchaudio 4 | raw_audio_dir = "./raw_audio/" 5 | denoise_audio_dir = "./denoised_audio/" 6 | filelist = list(os.walk(raw_audio_dir))[0][2] 7 | # 2023/4/21: Get the target sampling rate 8 | with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 9 | hps = json.load(f) 10 | target_sr = hps['data']['sampling_rate'] 11 | for file in filelist: 12 | if file.endswith(".wav"): 13 | os.system(f"demucs --two-stems=vocals {raw_audio_dir}{file}") 14 | for file in filelist: 15 | file = file.replace(".wav", "") 16 | wav, sr = torchaudio.load(f"./separated/htdemucs/{file}/vocals.wav", frame_offset=0, num_frames=-1, normalize=True, 17 | channels_first=True) 18 | # merge two channels into one 19 | wav = wav.mean(dim=0).unsqueeze(0) 20 | if sr != target_sr: 21 | wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav) 22 | torchaudio.save(denoise_audio_dir + file + ".wav", wav, target_sr, channels_first=True) -------------------------------------------------------------------------------- /scripts/download_model.py: -------------------------------------------------------------------------------- 1 | from google.colab import files 2 | files.download("./G_latest.pth") 3 | files.download("./finetune_speaker.json") 4 | files.download("./moegoe_config.json") -------------------------------------------------------------------------------- /scripts/download_video.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import shutil 4 | from concurrent.futures import ThreadPoolExecutor 5 | from google.colab import files 6 | import subprocess 7 | 8 | basepath = os.getcwd() 9 | uploaded = files.upload() # 上传文件 10 | for filename in uploaded.keys(): 11 | assert (filename.endswith(".txt")), "speaker-videolink info could only be .txt file!" 12 | shutil.move(os.path.join(basepath, filename), os.path.join("./speaker_links.txt")) 13 | 14 | 15 | def generate_infos(): 16 | infos = [] 17 | with open("./speaker_links.txt", 'r', encoding='utf-8') as f: 18 | lines = f.readlines() 19 | for line in lines: 20 | line = line.replace("\n", "").replace(" ", "") 21 | if line == "": 22 | continue 23 | speaker, link = line.split("|") 24 | filename = speaker + "_" + str(random.randint(0, 1000000)) 25 | infos.append({"link": link, "filename": filename}) 26 | return infos 27 | 28 | 29 | def download_video(info): 30 | 31 | link = info["link"] 32 | filename = info["filename"] 33 | print(f"Starting download for:\nFilename: {filename}\nLink: {link}") 34 | 35 | try: 36 | result = subprocess.run( 37 | ["yt-dlp", "-f", "30280", link, "-o", f"./video_data/{filename}.mp4", "--no-check-certificate"], 38 | stdout=subprocess.PIPE, 39 | stderr=subprocess.PIPE, 40 | text=True, 41 | check=True 42 | ) 43 | print(f"Download completed for {filename}:\n{result.stdout}") 44 | except subprocess.CalledProcessError as e: 45 | print(f"Failed to download {link}:\n{e.stderr}") 46 | 47 | 48 | if __name__ == "__main__": 49 | infos = generate_infos() 50 | with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor: 51 | executor.map(download_video, infos) 52 | -------------------------------------------------------------------------------- /scripts/long_audio_transcribe.py: -------------------------------------------------------------------------------- 1 | from moviepy.editor import AudioFileClip 2 | import whisper 3 | import os 4 | import json 5 | import torchaudio 6 | import librosa 7 | import torch 8 | import argparse 9 | parent_dir = "./denoised_audio/" 10 | filelist = list(os.walk(parent_dir))[0][2] 11 | if __name__ == "__main__": 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument("--languages", default="CJE") 14 | parser.add_argument("--whisper_size", default="medium") 15 | args = parser.parse_args() 16 | if args.languages == "CJE": 17 | lang2token = { 18 | 'zh': "[ZH]", 19 | 'ja': "[JA]", 20 | "en": "[EN]", 21 | } 22 | elif args.languages == "CJ": 23 | lang2token = { 24 | 'zh': "[ZH]", 25 | 'ja': "[JA]", 26 | } 27 | elif args.languages == "C": 28 | lang2token = { 29 | 'zh': "[ZH]", 30 | } 31 | assert(torch.cuda.is_available()), "Please enable GPU in order to run Whisper!" 32 | with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 33 | hps = json.load(f) 34 | target_sr = hps['data']['sampling_rate'] 35 | model = whisper.load_model(args.whisper_size) 36 | speaker_annos = [] 37 | for file in filelist: 38 | audio_path = os.path.join(parent_dir, file) 39 | print(f"Transcribing {audio_path}...\n") 40 | 41 | options = dict(beam_size=5, best_of=5) 42 | transcribe_options = dict(task="transcribe", **options) 43 | 44 | result = model.transcribe(audio_path, word_timestamps=True, **transcribe_options) 45 | segments = result["segments"] 46 | lang = result['language'] 47 | if lang not in lang2token: 48 | print(f"{lang} not supported, ignoring...\n") 49 | continue 50 | 51 | character_name = file.rstrip(".wav").split("_")[0] 52 | code = file.rstrip(".wav").split("_")[1] 53 | outdir = os.path.join("./segmented_character_voice", character_name) 54 | os.makedirs(outdir, exist_ok=True) 55 | 56 | wav, sr = torchaudio.load( 57 | audio_path, 58 | frame_offset=0, 59 | num_frames=-1, 60 | normalize=True, 61 | channels_first=True 62 | ) 63 | 64 | for i, seg in enumerate(segments): 65 | start_time = seg['start'] 66 | end_time = seg['end'] 67 | text = seg['text'] 68 | text_tokened = lang2token[lang] + text.replace("\n", "") + lang2token[lang] + "\n" 69 | start_idx = int(start_time * sr) 70 | end_idx = int(end_time * sr) 71 | num_samples = end_idx - start_idx 72 | if num_samples <= 0: 73 | print(f"Skipping zero-length segment: start={start_time}, end={end_time}") 74 | continue 75 | wav_seg = wav[:, start_idx:end_idx] 76 | if wav_seg.shape[1] == 0: 77 | print(f"Skipping empty segment i={i}, shape={wav_seg.shape}") 78 | continue 79 | if sr != target_sr: 80 | resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr) 81 | wav_seg = resampler(wav_seg) 82 | 83 | wav_seg_name = f"{character_name}_{code}_{i}.wav" 84 | savepth = os.path.join(outdir, wav_seg_name) 85 | speaker_annos.append(savepth + "|" + character_name + "|" + text_tokened) 86 | print(f"Transcribed segment: {speaker_annos[-1]}") 87 | torchaudio.save(savepth, wav_seg, target_sr, channels_first=True) 88 | 89 | if len(speaker_annos) == 0: 90 | print("Warning: no long audios & videos found, this IS expected if you have only uploaded short audios") 91 | print("this IS NOT expected if you have uploaded any long audios, videos or video links. Please check your file structure or make sure your audio/video language is supported.") 92 | with open("./long_character_anno.txt", 'w', encoding='utf-8') as f: 93 | for line in speaker_annos: 94 | f.write(line) 95 | -------------------------------------------------------------------------------- /scripts/rearrange_speaker.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import argparse 3 | import json 4 | 5 | if __name__ == "__main__": 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument("--model_dir", type=str, default="./OUTPUT_MODEL/G_latest.pth") 8 | parser.add_argument("--config_dir", type=str, default="./configs/modified_finetune_speaker.json") 9 | args = parser.parse_args() 10 | 11 | model_sd = torch.load(args.model_dir, map_location='cpu') 12 | with open(args.config_dir, 'r', encoding='utf-8') as f: 13 | hps = json.load(f) 14 | 15 | valid_speakers = list(hps['speakers'].keys()) 16 | if hps['data']['n_speakers'] > len(valid_speakers): 17 | new_emb_g = torch.zeros([len(valid_speakers), 256]) 18 | old_emb_g = model_sd['model']['emb_g.weight'] 19 | for i, speaker in enumerate(valid_speakers): 20 | new_emb_g[i, :] = old_emb_g[hps['speakers'][speaker], :] 21 | hps['speakers'][speaker] = i 22 | hps['data']['n_speakers'] = len(valid_speakers) 23 | model_sd['model']['emb_g.weight'] = new_emb_g 24 | with open("./finetune_speaker.json", 'w', encoding='utf-8') as f: 25 | json.dump(hps, f, indent=2) 26 | torch.save(model_sd, "./G_latest.pth") 27 | else: 28 | with open("./finetune_speaker.json", 'w', encoding='utf-8') as f: 29 | json.dump(hps, f, indent=2) 30 | torch.save(model_sd, "./G_latest.pth") 31 | # save another config file copy in MoeGoe format 32 | hps['speakers'] = valid_speakers 33 | with open("./moegoe_config.json", 'w', encoding='utf-8') as f: 34 | json.dump(hps, f, indent=2) 35 | 36 | 37 | 38 | -------------------------------------------------------------------------------- /scripts/resample.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import argparse 4 | import torchaudio 5 | 6 | 7 | def main(): 8 | with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 9 | hps = json.load(f) 10 | target_sr = hps['data']['sampling_rate'] 11 | filelist = list(os.walk("./sampled_audio4ft"))[0][2] 12 | if target_sr != 22050: 13 | for wavfile in filelist: 14 | wav, sr = torchaudio.load("./sampled_audio4ft" + "/" + wavfile, frame_offset=0, num_frames=-1, 15 | normalize=True, channels_first=True) 16 | wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav) 17 | torchaudio.save("./sampled_audio4ft" + "/" + wavfile, wav, target_sr, channels_first=True) 18 | 19 | if __name__ == "__main__": 20 | main() -------------------------------------------------------------------------------- /scripts/short_audio_transcribe.py: -------------------------------------------------------------------------------- 1 | import whisper 2 | import os 3 | import json 4 | import torchaudio 5 | import argparse 6 | import torch 7 | 8 | lang2token = { 9 | 'zh': "[ZH]", 10 | 'ja': "[JA]", 11 | "en": "[EN]", 12 | } 13 | def transcribe_one(audio_path): 14 | try: 15 | # load audio and pad/trim it to fit 30 seconds 16 | audio = whisper.load_audio(audio_path) 17 | audio = whisper.pad_or_trim(audio) 18 | 19 | # make log-Mel spectrogram and move to the same device as the model 20 | mel = whisper.log_mel_spectrogram(audio).to(model.device) 21 | 22 | # detect the spoken language 23 | _, probs = model.detect_language(mel) 24 | print(f"Detected language: {max(probs, key=probs.get)}") 25 | lang = max(probs, key=probs.get) 26 | # decode the audio 27 | options = whisper.DecodingOptions(beam_size=5) 28 | result = whisper.decode(model, mel, options) 29 | 30 | # print the recognized text 31 | print(result.text) 32 | return lang, result.text 33 | except Exception as e: 34 | print(e) 35 | if __name__ == "__main__": 36 | parser = argparse.ArgumentParser() 37 | parser.add_argument("--languages", default="CJE") 38 | parser.add_argument("--whisper_size", default="medium") 39 | args = parser.parse_args() 40 | if args.languages == "CJE": 41 | lang2token = { 42 | 'zh': "[ZH]", 43 | 'ja': "[JA]", 44 | "en": "[EN]", 45 | } 46 | elif args.languages == "CJ": 47 | lang2token = { 48 | 'zh': "[ZH]", 49 | 'ja': "[JA]", 50 | } 51 | elif args.languages == "C": 52 | lang2token = { 53 | 'zh': "[ZH]", 54 | } 55 | assert (torch.cuda.is_available()), "Please enable GPU in order to run Whisper!" 56 | model = whisper.load_model(args.whisper_size) 57 | parent_dir = "./custom_character_voice/" 58 | speaker_names = list(os.walk(parent_dir))[0][1] 59 | speaker_annos = [] 60 | total_files = sum([len(files) for r, d, files in os.walk(parent_dir)]) 61 | # resample audios 62 | # 2023/4/21: Get the target sampling rate 63 | with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 64 | hps = json.load(f) 65 | target_sr = hps['data']['sampling_rate'] 66 | processed_files = 0 67 | for speaker in speaker_names: 68 | for i, wavfile in enumerate(list(os.walk(parent_dir + speaker))[0][2]): 69 | # try to load file as audio 70 | if wavfile.startswith("processed_"): 71 | continue 72 | try: 73 | wav, sr = torchaudio.load(parent_dir + speaker + "/" + wavfile, frame_offset=0, num_frames=-1, normalize=True, 74 | channels_first=True) 75 | wav = wav.mean(dim=0).unsqueeze(0) 76 | if sr != target_sr: 77 | wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav) 78 | if wav.shape[1] / sr > 20: 79 | print(f"{wavfile} too long, ignoring\n") 80 | save_path = parent_dir + speaker + "/" + f"processed_{i}.wav" 81 | torchaudio.save(save_path, wav, target_sr, channels_first=True) 82 | # transcribe text 83 | lang, text = transcribe_one(save_path) 84 | if lang not in list(lang2token.keys()): 85 | print(f"{lang} not supported, ignoring\n") 86 | continue 87 | text = lang2token[lang] + text + lang2token[lang] + "\n" 88 | speaker_annos.append(save_path + "|" + speaker + "|" + text) 89 | 90 | processed_files += 1 91 | print(f"Processed: {processed_files}/{total_files}") 92 | except: 93 | continue 94 | 95 | # # clean annotation 96 | # import argparse 97 | # import text 98 | # from utils import load_filepaths_and_text 99 | # for i, line in enumerate(speaker_annos): 100 | # path, sid, txt = line.split("|") 101 | # cleaned_text = text._clean_text(txt, ["cjke_cleaners2"]) 102 | # cleaned_text += "\n" if not cleaned_text.endswith("\n") else "" 103 | # speaker_annos[i] = path + "|" + sid + "|" + cleaned_text 104 | # write into annotation 105 | if len(speaker_annos) == 0: 106 | print("Warning: no short audios found, this IS expected if you have only uploaded long audios, videos or video links.") 107 | print("this IS NOT expected if you have uploaded a zip file of short audios. Please check your file structure or make sure your audio language is supported.") 108 | with open("short_character_anno.txt", 'w', encoding='utf-8') as f: 109 | for line in speaker_annos: 110 | f.write(line) 111 | 112 | # import json 113 | # # generate new config 114 | # with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f: 115 | # hps = json.load(f) 116 | # # modify n_speakers 117 | # hps['data']["n_speakers"] = 1000 + len(speaker2id) 118 | # # add speaker names 119 | # for speaker in speaker_names: 120 | # hps['speakers'][speaker] = speaker2id[speaker] 121 | # # save modified config 122 | # with open("./configs/modified_finetune_speaker.json", 'w', encoding='utf-8') as f: 123 | # json.dump(hps, f, indent=2) 124 | # print("finished") 125 | -------------------------------------------------------------------------------- /scripts/video2audio.py: -------------------------------------------------------------------------------- 1 | import os 2 | from concurrent.futures import ThreadPoolExecutor 3 | 4 | from moviepy.editor import AudioFileClip 5 | 6 | video_dir = "./video_data/" 7 | audio_dir = "./raw_audio/" 8 | filelist = list(os.walk(video_dir))[0][2] 9 | 10 | 11 | def generate_infos(): 12 | videos = [] 13 | for file in filelist: 14 | if file.endswith(".mp4"): 15 | videos.append(file) 16 | return videos 17 | 18 | 19 | def clip_file(file): 20 | my_audio_clip = AudioFileClip(video_dir + file) 21 | my_audio_clip.write_audiofile(audio_dir + file.rstrip("mp4") + "wav") 22 | 23 | 24 | if __name__ == "__main__": 25 | infos = generate_infos() 26 | with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor: 27 | executor.map(clip_file, infos) 28 | -------------------------------------------------------------------------------- /scripts/voice_upload.py: -------------------------------------------------------------------------------- 1 | from google.colab import files 2 | import shutil 3 | import os 4 | import argparse 5 | if __name__ == "__main__": 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument("--type", type=str, required=True, help="type of file to upload") 8 | args = parser.parse_args() 9 | file_type = args.type 10 | 11 | basepath = os.getcwd() 12 | uploaded = files.upload() # 上传文件 13 | assert(file_type in ['zip', 'audio', 'video']) 14 | if file_type == "zip": 15 | upload_path = "./custom_character_voice/" 16 | for filename in uploaded.keys(): 17 | #将上传的文件移动到指定的位置上 18 | shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, "custom_character_voice.zip")) 19 | elif file_type == "audio": 20 | upload_path = "./raw_audio/" 21 | for filename in uploaded.keys(): 22 | #将上传的文件移动到指定的位置上 23 | shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, filename)) 24 | elif file_type == "video": 25 | upload_path = "./video_data/" 26 | for filename in uploaded.keys(): 27 | # 将上传的文件移动到指定的位置上 28 | shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, filename)) -------------------------------------------------------------------------------- /text/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Keith Ito 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /text/__init__.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | from text import cleaners 3 | from text.symbols import symbols 4 | 5 | 6 | # Mappings from symbol to numeric ID and vice versa: 7 | _symbol_to_id = {s: i for i, s in enumerate(symbols)} 8 | _id_to_symbol = {i: s for i, s in enumerate(symbols)} 9 | 10 | 11 | def text_to_sequence(text, symbols, cleaner_names): 12 | '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 13 | Args: 14 | text: string to convert to a sequence 15 | cleaner_names: names of the cleaner functions to run the text through 16 | Returns: 17 | List of integers corresponding to the symbols in the text 18 | ''' 19 | sequence = [] 20 | symbol_to_id = {s: i for i, s in enumerate(symbols)} 21 | clean_text = _clean_text(text, cleaner_names) 22 | print(clean_text) 23 | print(f" length:{len(clean_text)}") 24 | for symbol in clean_text: 25 | if symbol not in symbol_to_id.keys(): 26 | continue 27 | symbol_id = symbol_to_id[symbol] 28 | sequence += [symbol_id] 29 | print(f" length:{len(sequence)}") 30 | return sequence 31 | 32 | 33 | def cleaned_text_to_sequence(cleaned_text, symbols): 34 | '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text. 35 | Args: 36 | text: string to convert to a sequence 37 | Returns: 38 | List of integers corresponding to the symbols in the text 39 | ''' 40 | symbol_to_id = {s: i for i, s in enumerate(symbols)} 41 | sequence = [symbol_to_id[symbol] for symbol in cleaned_text if symbol in symbol_to_id.keys()] 42 | return sequence 43 | 44 | 45 | def sequence_to_text(sequence): 46 | '''Converts a sequence of IDs back to a string''' 47 | result = '' 48 | for symbol_id in sequence: 49 | s = _id_to_symbol[symbol_id] 50 | result += s 51 | return result 52 | 53 | 54 | def _clean_text(text, cleaner_names): 55 | for name in cleaner_names: 56 | cleaner = getattr(cleaners, name) 57 | if not cleaner: 58 | raise Exception('Unknown cleaner: %s' % name) 59 | text = cleaner(text) 60 | return text 61 | -------------------------------------------------------------------------------- /text/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/cleaners.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/cleaners.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/english.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/english.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/japanese.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/japanese.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/korean.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/korean.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/mandarin.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/mandarin.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/sanskrit.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/sanskrit.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/symbols.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/symbols.cpython-37.pyc -------------------------------------------------------------------------------- /text/__pycache__/thai.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Plachtaa/VITS-fast-fine-tuning/8d341c7215f7770e81e6ac4486602179883d09af/text/__pycache__/thai.cpython-37.pyc -------------------------------------------------------------------------------- /text/cantonese.py: -------------------------------------------------------------------------------- 1 | import re 2 | import cn2an 3 | import opencc 4 | 5 | 6 | converter = opencc.OpenCC('jyutjyu') 7 | 8 | # List of (Latin alphabet, ipa) pairs: 9 | _latin_to_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [ 10 | ('A', 'ei˥'), 11 | ('B', 'biː˥'), 12 | ('C', 'siː˥'), 13 | ('D', 'tiː˥'), 14 | ('E', 'iː˥'), 15 | ('F', 'e˥fuː˨˩'), 16 | ('G', 'tsiː˥'), 17 | ('H', 'ɪk̚˥tsʰyː˨˩'), 18 | ('I', 'ɐi˥'), 19 | ('J', 'tsei˥'), 20 | ('K', 'kʰei˥'), 21 | ('L', 'e˥llou˨˩'), 22 | ('M', 'ɛːm˥'), 23 | ('N', 'ɛːn˥'), 24 | ('O', 'ou˥'), 25 | ('P', 'pʰiː˥'), 26 | ('Q', 'kʰiːu˥'), 27 | ('R', 'aː˥lou˨˩'), 28 | ('S', 'ɛː˥siː˨˩'), 29 | ('T', 'tʰiː˥'), 30 | ('U', 'juː˥'), 31 | ('V', 'wiː˥'), 32 | ('W', 'tʊk̚˥piː˥juː˥'), 33 | ('X', 'ɪk̚˥siː˨˩'), 34 | ('Y', 'waːi˥'), 35 | ('Z', 'iː˨sɛːt̚˥') 36 | ]] 37 | 38 | 39 | def number_to_cantonese(text): 40 | return re.sub(r'\d+(?:\.?\d+)?', lambda x: cn2an.an2cn(x.group()), text) 41 | 42 | 43 | def latin_to_ipa(text): 44 | for regex, replacement in _latin_to_ipa: 45 | text = re.sub(regex, replacement, text) 46 | return text 47 | 48 | 49 | def cantonese_to_ipa(text): 50 | text = number_to_cantonese(text.upper()) 51 | text = converter.convert(text).replace('-','').replace('$',' ') 52 | text = re.sub(r'[A-Z]', lambda x: latin_to_ipa(x.group())+' ', text) 53 | text = re.sub(r'[、；：]', '，', text) 54 | text = re.sub(r'\s*，\s*', ', ', text) 55 | text = re.sub(r'\s*。\s*', '. ', text) 56 | text = re.sub(r'\s*？\s*', '? ', text) 57 | text = re.sub(r'\s*！\s*', '! ', text) 58 | text = re.sub(r'\s*$', '', text) 59 | return text 60 | -------------------------------------------------------------------------------- /text/cleaners.py: -------------------------------------------------------------------------------- 1 | import re 2 | from text.japanese import japanese_to_romaji_with_accent, japanese_to_ipa, japanese_to_ipa2, japanese_to_ipa3 3 | from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa 4 | from text.mandarin import number_to_chinese, chinese_to_bopomofo, latin_to_bopomofo, chinese_to_romaji, chinese_to_lazy_ipa, chinese_to_ipa, chinese_to_ipa2 5 | from text.sanskrit import devanagari_to_ipa 6 | from text.english import english_to_lazy_ipa, english_to_ipa2, english_to_lazy_ipa2 7 | from text.thai import num_to_thai, latin_to_thai 8 | # from text.shanghainese import shanghainese_to_ipa 9 | # from text.cantonese import cantonese_to_ipa 10 | # from text.ngu_dialect import ngu_dialect_to_ipa 11 | 12 | 13 | def japanese_cleaners(text): 14 | text = japanese_to_romaji_with_accent(text) 15 | text = re.sub(r'([A-Za-z])$', r'\1.', text) 16 | return text 17 | 18 | 19 | def japanese_cleaners2(text): 20 | return japanese_cleaners(text).replace('ts', 'ʦ').replace('...', '…') 21 | 22 | 23 | def korean_cleaners(text): 24 | '''Pipeline for Korean text''' 25 | text = latin_to_hangul(text) 26 | text = number_to_hangul(text) 27 | text = divide_hangul(text) 28 | text = re.sub(r'([\u3131-\u3163])$', r'\1.', text) 29 | return text 30 | 31 | 32 | def chinese_cleaners(text): 33 | '''Pipeline for Chinese text''' 34 | text = text.replace("[ZH]", "") 35 | text = number_to_chinese(text) 36 | text = chinese_to_bopomofo(text) 37 | text = latin_to_bopomofo(text) 38 | text = re.sub(r'([ˉˊˇˋ˙])$', r'\1。', text) 39 | return text 40 | 41 | 42 | def zh_ja_mixture_cleaners(text): 43 | text = re.sub(r'\[ZH\](.*?)\[ZH\]', 44 | lambda x: chinese_to_romaji(x.group(1))+' ', text) 45 | text = re.sub(r'\[JA\](.*?)\[JA\]', lambda x: japanese_to_romaji_with_accent( 46 | x.group(1)).replace('ts', 'ʦ').replace('u', 'ɯ').replace('...', '…')+' ', text) 47 | text = re.sub(r'\s+$', '', text) 48 | text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text) 49 | return text 50 | 51 | 52 | def sanskrit_cleaners(text): 53 | text = text.replace('॥', '।').replace('ॐ', 'ओम्') 54 | text = re.sub(r'([^।])$', r'\1।', text) 55 | return text 56 | 57 | 58 | def cjks_cleaners(text): 59 | text = re.sub(r'\[ZH\](.*?)\[ZH\]', 60 | lambda x: chinese_to_lazy_ipa(x.group(1))+' ', text) 61 | text = re.sub(r'\[JA\](.*?)\[JA\]', 62 | lambda x: japanese_to_ipa(x.group(1))+' ', text) 63 | text = re.sub(r'\[KO\](.*?)\[KO\]', 64 | lambda x: korean_to_lazy_ipa(x.group(1))+' ', text) 65 | text = re.sub(r'\[SA\](.*?)\[SA\]', 66 | lambda x: devanagari_to_ipa(x.group(1))+' ', text) 67 | text = re.sub(r'\[EN\](.*?)\[EN\]', 68 | lambda x: english_to_lazy_ipa(x.group(1))+' ', text) 69 | text = re.sub(r'\s+$', '', text) 70 | text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text) 71 | return text 72 | 73 | 74 | def cjke_cleaners(text): 75 | text = re.sub(r'\[ZH\](.*?)\[ZH\]', lambda x: chinese_to_lazy_ipa(x.group(1)).replace( 76 | 'ʧ', 'tʃ').replace('ʦ', 'ts').replace('ɥan', 'ɥæn')+' ', text) 77 | text = re.sub(r'\[JA\](.*?)\[JA\]', lambda x: japanese_to_ipa(x.group(1)).replace('ʧ', 'tʃ').replace( 78 | 'ʦ', 'ts').replace('ɥan', 'ɥæn').replace('ʥ', 'dz')+' ', text) 79 | text = re.sub(r'\[KO\](.*?)\[KO\]', 80 | lambda x: korean_to_ipa(x.group(1))+' ', text) 81 | text = re.sub(r'\[EN\](.*?)\[EN\]', lambda x: english_to_ipa2(x.group(1)).replace('ɑ', 'a').replace( 82 | 'ɔ', 'o').replace('ɛ', 'e').replace('ɪ', 'i').replace('ʊ', 'u')+' ', text) 83 | text = re.sub(r'\s+$', '', text) 84 | text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text) 85 | return text 86 | 87 | 88 | def cjke_cleaners2(text): 89 | text = re.sub(r'\[ZH\](.*?)\[ZH\]', 90 | lambda x: chinese_to_ipa(x.group(1))+' ', text) 91 | text = re.sub(r'\[JA\](.*?)\[JA\]', 92 | lambda x: japanese_to_ipa2(x.group(1))+' ', text) 93 | text = re.sub(r'\[KO\](.*?)\[KO\]', 94 | lambda x: korean_to_ipa(x.group(1))+' ', text) 95 | text = re.sub(r'\[EN\](.*?)\[EN\]', 96 | lambda x: english_to_ipa2(x.group(1))+' ', text) 97 | text = re.sub(r'\s+$', '', text) 98 | text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text) 99 | return text 100 | 101 | 102 | def thai_cleaners(text): 103 | text = num_to_thai(text) 104 | text = latin_to_thai(text) 105 | return text 106 | 107 | 108 | # def shanghainese_cleaners(text): 109 | # text = shanghainese_to_ipa(text) 110 | # text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text) 111 | # return text 112 | 113 | 114 | # def chinese_dialect_cleaners(text): 115 | # text = re.sub(r'\[ZH\](.*?)\[ZH\]', 116 | # lambda x: chinese_to_ipa2(x.group(1))+' ', text) 117 | # text = re.sub(r'\[JA\](.*?)\[JA\]', 118 | # lambda x: japanese_to_ipa3(x.group(1)).replace('Q', 'ʔ')+' ', text) 119 | # text = re.sub(r'\[SH\](.*?)\[SH\]', lambda x: shanghainese_to_ipa(x.group(1)).replace('1', '˥˧').replace('5', 120 | # '˧˧˦').replace('6', '˩˩˧').replace('7', '˥').replace('8', '˩˨').replace('ᴀ', 'ɐ').replace('ᴇ', 'e')+' ', text) 121 | # text = re.sub(r'\[GD\](.*?)\[GD\]', 122 | # lambda x: cantonese_to_ipa(x.group(1))+' ', text) 123 | # text = re.sub(r'\[EN\](.*?)\[EN\]', 124 | # lambda x: english_to_lazy_ipa2(x.group(1))+' ', text) 125 | # text = re.sub(r'\[([A-Z]{2})\](.*?)\[\1\]', lambda x: ngu_dialect_to_ipa(x.group(2), x.group( 126 | # 1)).replace('ʣ', 'dz').replace('ʥ', 'dʑ').replace('ʦ', 'ts').replace('ʨ', 'tɕ')+' ', text) 127 | # text = re.sub(r'\s+$', '', text) 128 | # text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text) 129 | # return text 130 | -------------------------------------------------------------------------------- /text/english.py: -------------------------------------------------------------------------------- 1 | """ from https://github.com/keithito/tacotron """ 2 | 3 | ''' 4 | Cleaners are transformations that run over the input text at both training and eval time. 5 | 6 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" 7 | hyperparameter. Some cleaners are English-specific. You'll typically want to use: 8 | 1. "english_cleaners" for English text 9 | 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using 10 | the Unidecode library (https://pypi.python.org/pypi/Unidecode) 11 | 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update 12 | the symbols in symbols.py to match your data). 13 | ''' 14 | 15 | 16 | # Regular expression matching whitespace: 17 | 18 | 19 | import re 20 | import inflect 21 | from unidecode import unidecode 22 | import eng_to_ipa as ipa 23 | _inflect = inflect.engine() 24 | _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') 25 | _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') 26 | _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') 27 | _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') 28 | _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') 29 | _number_re = re.compile(r'[0-9]+') 30 | 31 | # List of (regular expression, replacement) pairs for abbreviations: 32 | _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ 33 | ('mrs', 'misess'), 34 | ('mr', 'mister'), 35 | ('dr', 'doctor'), 36 | ('st', 'saint'), 37 | ('co', 'company'), 38 | ('jr', 'junior'), 39 | ('maj', 'major'), 40 | ('gen', 'general'), 41 | ('drs', 'doctors'), 42 | ('rev', 'reverend'), 43 | ('lt', 'lieutenant'), 44 | ('hon', 'honorable'), 45 | ('sgt', 'sergeant'), 46 | ('capt', 'captain'), 47 | ('esq', 'esquire'), 48 | ('ltd', 'limited'), 49 | ('col', 'colonel'), 50 | ('ft', 'fort'), 51 | ]] 52 | 53 | 54 | # List of (ipa, lazy ipa) pairs: 55 | _lazy_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [ 56 | ('r', 'ɹ'), 57 | ('æ', 'e'), 58 | ('ɑ', 'a'), 59 | ('ɔ', 'o'), 60 | ('ð', 'z'), 61 | ('θ', 's'), 62 | ('ɛ', 'e'), 63 | ('ɪ', 'i'), 64 | ('ʊ', 'u'), 65 | ('ʒ', 'ʥ'), 66 | ('ʤ', 'ʥ'), 67 | ('ˈ', '↓'), 68 | ]] 69 | 70 | # List of (ipa, lazy ipa2) pairs: 71 | _lazy_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [ 72 | ('r', 'ɹ'), 73 | ('ð', 'z'), 74 | ('θ', 's'), 75 | ('ʒ', 'ʑ'), 76 | ('ʤ', 'dʑ'), 77 | ('ˈ', '↓'), 78 | ]] 79 | 80 | # List of (ipa, ipa2) pairs 81 | _ipa_to_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [ 82 | ('r', 'ɹ'), 83 | ('ʤ', 'dʒ'), 84 | ('ʧ', 'tʃ') 85 | ]] 86 | 87 | 88 | def expand_abbreviations(text): 89 | for regex, replacement in _abbreviations: 90 | text = re.sub(regex, replacement, text) 91 | return text 92 | 93 | 94 | def collapse_whitespace(text): 95 | return re.sub(r'\s+', ' ', text) 96 | 97 | 98 | def _remove_commas(m): 99 | return m.group(1).replace(',', '') 100 | 101 | 102 | def _expand_decimal_point(m): 103 | return m.group(1).replace('.', ' point ') 104 | 105 | 106 | def _expand_dollars(m): 107 | match = m.group(1) 108 | parts = match.split('.') 109 | if len(parts) > 2: 110 | return match + ' dollars' # Unexpected format 111 | dollars = int(parts[0]) if parts[0] else 0 112 | cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 113 | if dollars and cents: 114 | dollar_unit = 'dollar' if dollars == 1 else 'dollars' 115 | cent_unit = 'cent' if cents == 1 else 'cents' 116 | return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) 117 | elif dollars: 118 | dollar_unit = 'dollar' if dollars == 1 else 'dollars' 119 | return '%s %s' % (dollars, dollar_unit) 120 | elif cents: 121 | cent_unit = 'cent' if cents == 1 else 'cents' 122 | return '%s %s' % (cents, cent_unit) 123 | else: 124 | return 'zero dollars' 125 | 126 | 127 | def _expand_ordinal(m): 128 | return _inflect.number_to_words(m.group(0)) 129 | 130 | 131 | def _expand_number(m): 132 | num = int(m.group(0)) 133 | if num > 1000 and num < 3000: 134 | if num == 2000: 135 | return 'two thousand' 136 | elif num > 2000 and num < 2010: 137 | return 'two thousand ' + _inflect.number_to_words(num % 100) 138 | elif num % 100 == 0: 139 | return _inflect.number_to_words(num // 100) + ' hundred' 140 | else: 141 | return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') 142 | else: 143 | return _inflect.number_to_words(num, andword='') 144 | 145 | 146 | def normalize_numbers(text): 147 | text = re.sub(_comma_number_re, _remove_commas, text) 148 | text = re.sub(_pounds_re, r'\1 pounds', text) 149 | text = re.sub(_dollars_re, _expand_dollars, text) 150 | text = re.sub(_decimal_number_re, _expand_decimal_point, text) 151 | text = re.sub(_ordinal_re, _expand_ordinal, text) 152 | text = re.sub(_number_re, _expand_number, text) 153 | return text 154 | 155 | 156 | def mark_dark_l(text): 157 | return re.sub(r'l([^aeiouæɑɔəɛɪʊ ]*(?: |$))', lambda x: 'ɫ'+x.group(1), text) 158 | 159 | 160 | def english_to_ipa(text): 161 | text = unidecode(text).lower() 162 | text = expand_abbreviations(text) 163 | text = normalize_numbers(text) 164 | phonemes = ipa.convert(text) 165 | phonemes = collapse_whitespace(phonemes) 166 | return phonemes 167 | 168 | 169 | def english_to_lazy_ipa(text): 170 | text = english_to_ipa(text) 171 | for regex, replacement in _lazy_ipa: 172 | text = re.sub(regex, replacement, text) 173 | return text 174 | 175 | 176 | def english_to_ipa2(text): 177 | text = english_to_ipa(text) 178 | text = mark_dark_l(text) 179 | for regex, replacement in _ipa_to_ipa2: 180 | text = re.sub(regex, replacement, text) 181 | return text.replace('...', '…') 182 | 183 | 184 | def english_to_lazy_ipa2(text): 185 | text = english_to_ipa(text) 186 | for regex, replacement in _lazy_ipa2: 187 | text = re.sub(regex, replacement, text) 188 | return text 189 | -------------------------------------------------------------------------------- /text/japanese.py: -------------------------------------------------------------------------------- 1 | import re 2 | from unidecode import unidecode 3 | import pyopenjtalk 4 | 5 | 6 | # Regular expression matching Japanese without punctuation marks: 7 | _japanese_characters = re.compile( 8 | r'[A-Za-z\d\u3005\u3040-\u30ff\u4e00-\u9fff\uff11-\uff19\uff21-\uff3a\uff41-\uff5a\uff66-\uff9d]') 9 | 10 | # Regular expression matching non-Japanese characters or punctuation marks: 11 | _japanese_marks = re.compile( 12 | r'[^A-Za-z\d\u3005\u3040-\u30ff\u4e00-\u9fff\uff11-\uff19\uff21-\uff3a\uff41-\uff5a\uff66-\uff9d]') 13 | 14 | # List of (symbol, Japanese) pairs for marks: 15 | _symbols_to_japanese = [(re.compile('%s' % x[0]), x[1]) for x in [ 16 | ('％', 'パーセント') 17 | ]] 18 | 19 | # List of (romaji, ipa) pairs for marks: 20 | _romaji_to_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [ 21 | ('ts', 'ʦ'), 22 | ('u', 'ɯ'), 23 | ('j', 'ʥ'), 24 | ('y', 'j'), 25 | ('ni', 'n^i'), 26 | ('nj', 'n^'), 27 | ('hi', 'çi'), 28 | ('hj', 'ç'), 29 | ('f', 'ɸ'), 30 | ('I', 'i*'), 31 | ('U', 'ɯ*'), 32 | ('r', 'ɾ') 33 | ]] 34 | 35 | # List of (romaji, ipa2) pairs for marks: 36 | _romaji_to_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [ 37 | ('u', 'ɯ'), 38 | ('ʧ', 'tʃ'), 39 | ('j', 'dʑ'), 40 | ('y', 'j'), 41 | ('ni', 'n^i'), 42 | ('nj', 'n^'), 43 | ('hi', 'çi'), 44 | ('hj', 'ç'), 45 | ('f', 'ɸ'), 46 | ('I', 'i*'), 47 | ('U', 'ɯ*'), 48 | ('r', 'ɾ') 49 | ]] 50 | 51 | # List of (consonant, sokuon) pairs: 52 | _real_sokuon = [(re.compile('%s' % x[0]), x[1]) for x in [ 53 | (r'Q([↑↓]*[kg])', r'k#\1'), 54 | (r'Q([↑↓]*[tdjʧ])', r't#\1'), 55 | (r'Q([↑↓]*[sʃ])', r's\1'), 56 | (r'Q([↑↓]*[pb])', r'p#\1') 57 | ]] 58 | 59 | # List of (consonant, hatsuon) pairs: 60 | _real_hatsuon = [(re.compile('%s' % x[0]), x[1]) for x in [ 61 | (r'N([↑↓]*[pbm])', r'm\1'), 62 | (r'N([↑↓]*[ʧʥj])', r'n^\1'), 63 | (r'N([↑↓]*[tdn])', r'n\1'), 64 | (r'N([↑↓]*[kg])', r'ŋ\1') 65 | ]] 66 | 67 | 68 | def symbols_to_japanese(text): 69 | for regex, replacement in _symbols_to_japanese: 70 | text = re.sub(regex, replacement, text) 71 | return text 72 | 73 | 74 | def japanese_to_romaji_with_accent(text): 75 | '''Reference https://r9y9.github.io/ttslearn/latest/notebooks/ch10_Recipe-Tacotron.html''' 76 | text = symbols_to_japanese(text) 77 | sentences = re.split(_japanese_marks, text) 78 | marks = re.findall(_japanese_marks, text) 79 | text = '' 80 | for i, sentence in enumerate(sentences): 81 | if re.match(_japanese_characters, sentence): 82 | if text != '': 83 | text += ' ' 84 | labels = pyopenjtalk.extract_fullcontext(sentence) 85 | for n, label in enumerate(labels): 86 | phoneme = re.search(r'\-([^\+]*)\+', label).group(1) 87 | if phoneme not in ['sil', 'pau']: 88 | text += phoneme.replace('ch', 'ʧ').replace('sh', 89 | 'ʃ').replace('cl', 'Q') 90 | else: 91 | continue 92 | # n_moras = int(re.search(r'/F:(\d+)_', label).group(1)) 93 | a1 = int(re.search(r"/A:(\-?[0-9]+)\+", label).group(1)) 94 | a2 = int(re.search(r"\+(\d+)\+", label).group(1)) 95 | a3 = int(re.search(r"\+(\d+)/", label).group(1)) 96 | if re.search(r'\-([^\+]*)\+', labels[n + 1]).group(1) in ['sil', 'pau']: 97 | a2_next = -1 98 | else: 99 | a2_next = int( 100 | re.search(r"\+(\d+)\+", labels[n + 1]).group(1)) 101 | # Accent phrase boundary 102 | if a3 == 1 and a2_next == 1: 103 | text += ' ' 104 | # Falling 105 | elif a1 == 0 and a2_next == a2 + 1: 106 | text += '↓' 107 | # Rising 108 | elif a2 == 1 and a2_next == 2: 109 | text += '↑' 110 | if i < len(marks): 111 | text += unidecode(marks[i]).replace(' ', '') 112 | return text 113 | 114 | 115 | def get_real_sokuon(text): 116 | for regex, replacement in _real_sokuon: 117 | text = re.sub(regex, replacement, text) 118 | return text 119 | 120 | 121 | def get_real_hatsuon(text): 122 | for regex, replacement in _real_hatsuon: 123 | text = re.sub(regex, replacement, text) 124 | return text 125 | 126 | 127 | def japanese_to_ipa(text): 128 | text = japanese_to_romaji_with_accent(text).replace('...', '…') 129 | text = re.sub( 130 | r'([aiueo])\1+', lambda x: x.group(0)[0]+'ː'*(len(x.group(0))-1), text) 131 | text = get_real_sokuon(text) 132 | text = get_real_hatsuon(text) 133 | for regex, replacement in _romaji_to_ipa: 134 | text = re.sub(regex, replacement, text) 135 | return text 136 | 137 | 138 | def japanese_to_ipa2(text): 139 | text = japanese_to_romaji_with_accent(text).replace('...', '…') 140 | text = get_real_sokuon(text) 141 | text = get_real_hatsuon(text) 142 | for regex, replacement in _romaji_to_ipa2: 143 | text = re.sub(regex, replacement, text) 144 | return text 145 | 146 | 147 | def japanese_to_ipa3(text): 148 | text = japanese_to_ipa2(text).replace('n^', 'ȵ').replace( 149 | 'ʃ', 'ɕ').replace('*', '\u0325').replace('#', '\u031a') 150 | text = re.sub( 151 | r'([aiɯeo])\1+', lambda x: x.group(0)[0]+'ː'*(len(x.group(0))-1), text) 152 | text = re.sub(r'((?:^|\s)(?:ts|tɕ|[kpt]))', r'\1ʰ', text) 153 | return text 154 | -------------------------------------------------------------------------------- /text/korean.py: -------------------------------------------------------------------------------- 1 | import re 2 | from jamo import h2j, j2hcj 3 | import ko_pron 4 | 5 | 6 | # This is a list of Korean classifiers preceded by pure Korean numerals. 7 | _korean_classifiers = '군데 권 개 그루 닢 대 두 마리 모 모금 뭇 발 발짝 방 번 벌 보루 살 수 술 시 쌈 움큼 정 짝 채 척 첩 축 켤레 톨 통' 8 | 9 | # List of (hangul, hangul divided) pairs: 10 | _hangul_divided = [(re.compile('%s' % x[0]), x[1]) for x in [ 11 | ('ㄳ', 'ㄱㅅ'), 12 | ('ㄵ', 'ㄴㅈ'), 13 | ('ㄶ', 'ㄴㅎ'), 14 | ('ㄺ', 'ㄹㄱ'), 15 | ('ㄻ', 'ㄹㅁ'), 16 | ('ㄼ', 'ㄹㅂ'), 17 | ('ㄽ', 'ㄹㅅ'), 18 | ('ㄾ', 'ㄹㅌ'), 19 | ('ㄿ', 'ㄹㅍ'), 20 | ('ㅀ', 'ㄹㅎ'), 21 | ('ㅄ', 'ㅂㅅ'), 22 | ('ㅘ', 'ㅗㅏ'), 23 | ('ㅙ', 'ㅗㅐ'), 24 | ('ㅚ', 'ㅗㅣ'), 25 | ('ㅝ', 'ㅜㅓ'), 26 | ('ㅞ', 'ㅜㅔ'), 27 | ('ㅟ', 'ㅜㅣ'), 28 | ('ㅢ', 'ㅡㅣ'), 29 | ('ㅑ', 'ㅣㅏ'), 30 | ('ㅒ', 'ㅣㅐ'), 31 | ('ㅕ', 'ㅣㅓ'), 32 | ('ㅖ', 'ㅣㅔ'), 33 | ('ㅛ', 'ㅣㅗ'), 34 | ('ㅠ', 'ㅣㅜ') 35 | ]] 36 | 37 | # List of (Latin alphabet, hangul) pairs: 38 | _latin_to_hangul = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [ 39 | ('a', '에이'), 40 | ('b', '비'), 41 | ('c', '시'), 42 | ('d', '디'), 43 | ('e', '이'), 44 | ('f', '에프'), 45 | ('g', '지'), 46 | ('h', '에이치'), 47 | ('i', '아이'), 48 | ('j', '제이'), 49 | ('k', '케이'), 50 | ('l', '엘'), 51 | ('m', '엠'), 52 | ('n', '엔'), 53 | ('o', '오'), 54 | ('p', '피'), 55 | ('q', '큐'), 56 | ('r', '아르'), 57 | ('s', '에스'), 58 | ('t', '티'), 59 | ('u', '유'), 60 | ('v', '브이'), 61 | ('w', '더블유'), 62 | ('x', '엑스'), 63 | ('y', '와이'), 64 | ('z', '제트') 65 | ]] 66 | 67 | # List of (ipa, lazy ipa) pairs: 68 | _ipa_to_lazy_ipa = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [ 69 | ('t͡ɕ','ʧ'), 70 | ('d͡ʑ','ʥ'), 71 | ('ɲ','n^'), 72 | ('ɕ','ʃ'), 73 | ('ʷ','w'), 74 | ('ɭ','l`'), 75 | ('ʎ','ɾ'), 76 | ('ɣ','ŋ'), 77 | ('ɰ','ɯ'), 78 | ('ʝ','j'), 79 | ('ʌ','ə'), 80 | ('ɡ','g'), 81 | ('\u031a','#'), 82 | ('\u0348','='), 83 | ('\u031e',''), 84 | ('\u0320',''), 85 | ('\u0339','') 86 | ]] 87 | 88 | 89 | def latin_to_hangul(text): 90 | for regex, replacement in _latin_to_hangul: 91 | text = re.sub(regex, replacement, text) 92 | return text 93 | 94 | 95 | def divide_hangul(text): 96 | text = j2hcj(h2j(text)) 97 | for regex, replacement in _hangul_divided: 98 | text = re.sub(regex, replacement, text) 99 | return text 100 | 101 | 102 | def hangul_number(num, sino=True): 103 | '''Reference https://github.com/Kyubyong/g2pK''' 104 | num = re.sub(',', '', num) 105 | 106 | if num == '0': 107 | return '영' 108 | if not sino and num == '20': 109 | return '스무' 110 | 111 | digits = '123456789' 112 | names = '일이삼사오육칠팔구' 113 | digit2name = {d: n for d, n in zip(digits, names)} 114 | 115 | modifiers = '한 두 세 네 다섯 여섯 일곱 여덟 아홉' 116 | decimals = '열 스물 서른 마흔 쉰 예순 일흔 여든 아흔' 117 | digit2mod = {d: mod for d, mod in zip(digits, modifiers.split())} 118 | digit2dec = {d: dec for d, dec in zip(digits, decimals.split())} 119 | 120 | spelledout = [] 121 | for i, digit in enumerate(num): 122 | i = len(num) - i - 1 123 | if sino: 124 | if i == 0: 125 | name = digit2name.get(digit, '') 126 | elif i == 1: 127 | name = digit2name.get(digit, '') + '십' 128 | name = name.replace('일십', '십') 129 | else: 130 | if i == 0: 131 | name = digit2mod.get(digit, '') 132 | elif i == 1: 133 | name = digit2dec.get(digit, '') 134 | if digit == '0': 135 | if i % 4 == 0: 136 | last_three = spelledout[-min(3, len(spelledout)):] 137 | if ''.join(last_three) == '': 138 | spelledout.append('') 139 | continue 140 | else: 141 | spelledout.append('') 142 | continue 143 | if i == 2: 144 | name = digit2name.get(digit, '') + '백' 145 | name = name.replace('일백', '백') 146 | elif i == 3: 147 | name = digit2name.get(digit, '') + '천' 148 | name = name.replace('일천', '천') 149 | elif i == 4: 150 | name = digit2name.get(digit, '') + '만' 151 | name = name.replace('일만', '만') 152 | elif i == 5: 153 | name = digit2name.get(digit, '') + '십' 154 | name = name.replace('일십', '십') 155 | elif i == 6: 156 | name = digit2name.get(digit, '') + '백' 157 | name = name.replace('일백', '백') 158 | elif i == 7: 159 | name = digit2name.get(digit, '') + '천' 160 | name = name.replace('일천', '천') 161 | elif i == 8: 162 | name = digit2name.get(digit, '') + '억' 163 | elif i == 9: 164 | name = digit2name.get(digit, '') + '십' 165 | elif i == 10: 166 | name = digit2name.get(digit, '') + '백' 167 | elif i == 11: 168 | name = digit2name.get(digit, '') + '천' 169 | elif i == 12: 170 | name = digit2name.get(digit, '') + '조' 171 | elif i == 13: 172 | name = digit2name.get(digit, '') + '십' 173 | elif i == 14: 174 | name = digit2name.get(digit, '') + '백' 175 | elif i == 15: 176 | name = digit2name.get(digit, '') + '천' 177 | spelledout.append(name) 178 | return ''.join(elem for elem in spelledout) 179 | 180 | 181 | def number_to_hangul(text): 182 | '''Reference https://github.com/Kyubyong/g2pK''' 183 | tokens = set(re.findall(r'(\d[\d,]*)([\uac00-\ud71f]+)', text)) 184 | for token in tokens: 185 | num, classifier = token 186 | if classifier[:2] in _korean_classifiers or classifier[0] in _korean_classifiers: 187 | spelledout = hangul_number(num, sino=False) 188 | else: 189 | spelledout = hangul_number(num, sino=True) 190 | text = text.replace(f'{num}{classifier}', f'{spelledout}{classifier}') 191 | # digit by digit for remaining digits 192 | digits = '0123456789' 193 | names = '영일이삼사오육칠팔구' 194 | for d, n in zip(digits, names): 195 | text = text.replace(d, n) 196 | return text 197 | 198 | 199 | def korean_to_lazy_ipa(text): 200 | text = latin_to_hangul(text) 201 | text = number_to_hangul(text) 202 | text=re.sub('[\uac00-\ud7af]+',lambda x:ko_pron.romanise(x.group(0),'ipa').split('] ~ [')[0],text) 203 | for regex, replacement in _ipa_to_lazy_ipa: 204 | text = re.sub(regex, replacement, text) 205 | return text 206 | 207 | 208 | def korean_to_ipa(text): 209 | text = korean_to_lazy_ipa(text) 210 | return text.replace('ʧ','tʃ').replace('ʥ','dʑ') 211 | -------------------------------------------------------------------------------- /text/mandarin.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import re 4 | from pypinyin import lazy_pinyin, BOPOMOFO 5 | import jieba 6 | import cn2an 7 | import logging 8 | 9 | 10 | # List of (Latin alphabet, bopomofo) pairs: 11 | _latin_to_bopomofo = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [ 12 | ('a', 'ㄟˉ'), 13 | ('b', 'ㄅㄧˋ'), 14 | ('c', 'ㄙㄧˉ'), 15 | ('d', 'ㄉㄧˋ'), 16 | ('e', 'ㄧˋ'), 17 | ('f', 'ㄝˊㄈㄨˋ'), 18 | ('g', 'ㄐㄧˋ'), 19 | ('h', 'ㄝˇㄑㄩˋ'), 20 | ('i', 'ㄞˋ'), 21 | ('j', 'ㄐㄟˋ'), 22 | ('k', 'ㄎㄟˋ'), 23 | ('l', 'ㄝˊㄛˋ'), 24 | ('m', 'ㄝˊㄇㄨˋ'), 25 | ('n', 'ㄣˉ'), 26 | ('o', 'ㄡˉ'), 27 | ('p', 'ㄆㄧˉ'), 28 | ('q', 'ㄎㄧㄡˉ'), 29 | ('r', 'ㄚˋ'), 30 | ('s', 'ㄝˊㄙˋ'), 31 | ('t', 'ㄊㄧˋ'), 32 | ('u', 'ㄧㄡˉ'), 33 | ('v', 'ㄨㄧˉ'), 34 | ('w', 'ㄉㄚˋㄅㄨˋㄌㄧㄡˋ'), 35 | ('x', 'ㄝˉㄎㄨˋㄙˋ'), 36 | ('y', 'ㄨㄞˋ'), 37 | ('z', 'ㄗㄟˋ') 38 | ]] 39 | 40 | # List of (bopomofo, romaji) pairs: 41 | _bopomofo_to_romaji = [(re.compile('%s' % x[0]), x[1]) for x in [ 42 | ('ㄅㄛ', 'p⁼wo'), 43 | ('ㄆㄛ', 'pʰwo'), 44 | ('ㄇㄛ', 'mwo'), 45 | ('ㄈㄛ', 'fwo'), 46 | ('ㄅ', 'p⁼'), 47 | ('ㄆ', 'pʰ'), 48 | ('ㄇ', 'm'), 49 | ('ㄈ', 'f'), 50 | ('ㄉ', 't⁼'), 51 | ('ㄊ', 'tʰ'), 52 | ('ㄋ', 'n'), 53 | ('ㄌ', 'l'), 54 | ('ㄍ', 'k⁼'), 55 | ('ㄎ', 'kʰ'), 56 | ('ㄏ', 'h'), 57 | ('ㄐ', 'ʧ⁼'), 58 | ('ㄑ', 'ʧʰ'), 59 | ('ㄒ', 'ʃ'), 60 | ('ㄓ', 'ʦ`⁼'), 61 | ('ㄔ', 'ʦ`ʰ'), 62 | ('ㄕ', 's`'), 63 | ('ㄖ', 'ɹ`'), 64 | ('ㄗ', 'ʦ⁼'), 65 | ('ㄘ', 'ʦʰ'), 66 | ('ㄙ', 's'), 67 | ('ㄚ', 'a'), 68 | ('ㄛ', 'o'), 69 | ('ㄜ', 'ə'), 70 | ('ㄝ', 'e'), 71 | ('ㄞ', 'ai'), 72 | ('ㄟ', 'ei'), 73 | ('ㄠ', 'au'), 74 | ('ㄡ', 'ou'), 75 | ('ㄧㄢ', 'yeNN'), 76 | ('ㄢ', 'aNN'), 77 | ('ㄧㄣ', 'iNN'), 78 | ('ㄣ', 'əNN'), 79 | ('ㄤ', 'aNg'), 80 | ('ㄧㄥ', 'iNg'), 81 | ('ㄨㄥ', 'uNg'), 82 | ('ㄩㄥ', 'yuNg'), 83 | ('ㄥ', 'əNg'), 84 | ('ㄦ', 'əɻ'), 85 | ('ㄧ', 'i'), 86 | ('ㄨ', 'u'), 87 | ('ㄩ', 'ɥ'), 88 | ('ˉ', '→'), 89 | ('ˊ', '↑'), 90 | ('ˇ', '↓↑'), 91 | ('ˋ', '↓'), 92 | ('˙', ''), 93 | ('，', ','), 94 | ('。', '.'), 95 | ('！', '!'), 96 | ('？', '?'), 97 | ('—', '-') 98 | ]] 99 | 100 | # List of (romaji, ipa) pairs: 101 | _romaji_to_ipa = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [ 102 | ('ʃy', 'ʃ'), 103 | ('ʧʰy', 'ʧʰ'), 104 | ('ʧ⁼y', 'ʧ⁼'), 105 | ('NN', 'n'), 106 | ('Ng', 'ŋ'), 107 | ('y', 'j'), 108 | ('h', 'x') 109 | ]] 110 | 111 | # List of (bopomofo, ipa) pairs: 112 | _bopomofo_to_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [ 113 | ('ㄅㄛ', 'p⁼wo'), 114 | ('ㄆㄛ', 'pʰwo'), 115 | ('ㄇㄛ', 'mwo'), 116 | ('ㄈㄛ', 'fwo'), 117 | ('ㄅ', 'p⁼'), 118 | ('ㄆ', 'pʰ'), 119 | ('ㄇ', 'm'), 120 | ('ㄈ', 'f'), 121 | ('ㄉ', 't⁼'), 122 | ('ㄊ', 'tʰ'), 123 | ('ㄋ', 'n'), 124 | ('ㄌ', 'l'), 125 | ('ㄍ', 'k⁼'), 126 | ('ㄎ', 'kʰ'), 127 | ('ㄏ', 'x'), 128 | ('ㄐ', 'tʃ⁼'), 129 | ('ㄑ', 'tʃʰ'), 130 | ('ㄒ', 'ʃ'), 131 | ('ㄓ', 'ts`⁼'), 132 | ('ㄔ', 'ts`ʰ'), 133 | ('ㄕ', 's`'), 134 | ('ㄖ', 'ɹ`'), 135 | ('ㄗ', 'ts⁼'), 136 | ('ㄘ', 'tsʰ'), 137 | ('ㄙ', 's'), 138 | ('ㄚ', 'a'), 139 | ('ㄛ', 'o'), 140 | ('ㄜ', 'ə'), 141 | ('ㄝ', 'ɛ'), 142 | ('ㄞ', 'aɪ'), 143 | ('ㄟ', 'eɪ'), 144 | ('ㄠ', 'ɑʊ'), 145 | ('ㄡ', 'oʊ'), 146 | ('ㄧㄢ', 'jɛn'), 147 | ('ㄩㄢ', 'ɥæn'), 148 | ('ㄢ', 'an'), 149 | ('ㄧㄣ', 'in'), 150 | ('ㄩㄣ', 'ɥn'), 151 | ('ㄣ', 'ən'), 152 | ('ㄤ', 'ɑŋ'), 153 | ('ㄧㄥ', 'iŋ'), 154 | ('ㄨㄥ', 'ʊŋ'), 155 | ('ㄩㄥ', 'jʊŋ'), 156 | ('ㄥ', 'əŋ'), 157 | ('ㄦ', 'əɻ'), 158 | ('ㄧ', 'i'), 159 | ('ㄨ', 'u'), 160 | ('ㄩ', 'ɥ'), 161 | ('ˉ', '→'), 162 | ('ˊ', '↑'), 163 | ('ˇ', '↓↑'), 164 | ('ˋ', '↓'), 165 | ('˙', ''), 166 | ('，', ','), 167 | ('。', '.'), 168 | ('！', '!'), 169 | ('？', '?'), 170 | ('—', '-') 171 | ]] 172 | 173 | # List of (bopomofo, ipa2) pairs: 174 | _bopomofo_to_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [ 175 | ('ㄅㄛ', 'pwo'), 176 | ('ㄆㄛ', 'pʰwo'), 177 | ('ㄇㄛ', 'mwo'), 178 | ('ㄈㄛ', 'fwo'), 179 | ('ㄅ', 'p'), 180 | ('ㄆ', 'pʰ'), 181 | ('ㄇ', 'm'), 182 | ('ㄈ', 'f'), 183 | ('ㄉ', 't'), 184 | ('ㄊ', 'tʰ'), 185 | ('ㄋ', 'n'), 186 | ('ㄌ', 'l'), 187 | ('ㄍ', 'k'), 188 | ('ㄎ', 'kʰ'), 189 | ('ㄏ', 'h'), 190 | ('ㄐ', 'tɕ'), 191 | ('ㄑ', 'tɕʰ'), 192 | ('ㄒ', 'ɕ'), 193 | ('ㄓ', 'tʂ'), 194 | ('ㄔ', 'tʂʰ'), 195 | ('ㄕ', 'ʂ'), 196 | ('ㄖ', 'ɻ'), 197 | ('ㄗ', 'ts'), 198 | ('ㄘ', 'tsʰ'), 199 | ('ㄙ', 's'), 200 | ('ㄚ', 'a'), 201 | ('ㄛ', 'o'), 202 | ('ㄜ', 'ɤ'), 203 | ('ㄝ', 'ɛ'), 204 | ('ㄞ', 'aɪ'), 205 | ('ㄟ', 'eɪ'), 206 | ('ㄠ', 'ɑʊ'), 207 | ('ㄡ', 'oʊ'), 208 | ('ㄧㄢ', 'jɛn'), 209 | ('ㄩㄢ', 'yæn'), 210 | ('ㄢ', 'an'), 211 | ('ㄧㄣ', 'in'), 212 | ('ㄩㄣ', 'yn'), 213 | ('ㄣ', 'ən'), 214 | ('ㄤ', 'ɑŋ'), 215 | ('ㄧㄥ', 'iŋ'), 216 | ('ㄨㄥ', 'ʊŋ'), 217 | ('ㄩㄥ', 'jʊŋ'), 218 | ('ㄥ', 'ɤŋ'), 219 | ('ㄦ', 'əɻ'), 220 | ('ㄧ', 'i'), 221 | ('ㄨ', 'u'), 222 | ('ㄩ', 'y'), 223 | ('ˉ', '˥'), 224 | ('ˊ', '˧˥'), 225 | ('ˇ', '˨˩˦'), 226 | ('ˋ', '˥˩'), 227 | ('˙', ''), 228 | ('，', ','), 229 | ('。', '.'), 230 | ('！', '!'), 231 | ('？', '?'), 232 | ('—', '-') 233 | ]] 234 | 235 | 236 | def number_to_chinese(text): 237 | numbers = re.findall(r'\d+(?:\.?\d+)?', text) 238 | for number in numbers: 239 | text = text.replace(number, cn2an.an2cn(number), 1) 240 | return text 241 | 242 | 243 | def chinese_to_bopomofo(text): 244 | text = text.replace('、', '，').replace('；', '，').replace('：', '，') 245 | words = jieba.lcut(text, cut_all=False) 246 | text = '' 247 | for word in words: 248 | bopomofos = lazy_pinyin(word, BOPOMOFO) 249 | if not re.search('[\u4e00-\u9fff]', word): 250 | text += word 251 | continue 252 | for i in range(len(bopomofos)): 253 | bopomofos[i] = re.sub(r'([\u3105-\u3129])$', r'\1ˉ', bopomofos[i]) 254 | if text != '': 255 | text += ' ' 256 | text += ''.join(bopomofos) 257 | return text 258 | 259 | 260 | def latin_to_bopomofo(text): 261 | for regex, replacement in _latin_to_bopomofo: 262 | text = re.sub(regex, replacement, text) 263 | return text 264 | 265 | 266 | def bopomofo_to_romaji(text): 267 | for regex, replacement in _bopomofo_to_romaji: 268 | text = re.sub(regex, replacement, text) 269 | return text 270 | 271 | 272 | def bopomofo_to_ipa(text): 273 | for regex, replacement in _bopomofo_to_ipa: 274 | text = re.sub(regex, replacement, text) 275 | return text 276 | 277 | 278 | def bopomofo_to_ipa2(text): 279 | for regex, replacement in _bopomofo_to_ipa2: 280 | text = re.sub(regex, replacement, text) 281 | return text 282 | 283 | 284 | def chinese_to_romaji(text): 285 | text = number_to_chinese(text) 286 | text = chinese_to_bopomofo(text) 287 | text = latin_to_bopomofo(text) 288 | text = bopomofo_to_romaji(text) 289 | text = re.sub('i([aoe])', r'y\1', text) 290 | text = re.sub('u([aoəe])', r'w\1', text) 291 | text = re.sub('([ʦsɹ]`[⁼ʰ]?)([→↓↑ ]+|$)', 292 | r'\1ɹ`\2', text).replace('ɻ', 'ɹ`') 293 | text = re.sub('([ʦs][⁼ʰ]?)([→↓↑ ]+|$)', r'\1ɹ\2', text) 294 | return text 295 | 296 | 297 | def chinese_to_lazy_ipa(text): 298 | text = chinese_to_romaji(text) 299 | for regex, replacement in _romaji_to_ipa: 300 | text = re.sub(regex, replacement, text) 301 | return text 302 | 303 | 304 | def chinese_to_ipa(text): 305 | text = number_to_chinese(text) 306 | text = chinese_to_bopomofo(text) 307 | text = latin_to_bopomofo(text) 308 | text = bopomofo_to_ipa(text) 309 | text = re.sub('i([aoe])', r'j\1', text) 310 | text = re.sub('u([aoəe])', r'w\1', text) 311 | text = re.sub('([sɹ]`[⁼ʰ]?)([→↓↑ ]+|$)', 312 | r'\1ɹ`\2', text).replace('ɻ', 'ɹ`') 313 | text = re.sub('([s][⁼ʰ]?)([→↓↑ ]+|$)', r'\1ɹ\2', text) 314 | return text 315 | 316 | 317 | def chinese_to_ipa2(text): 318 | text = number_to_chinese(text) 319 | text = chinese_to_bopomofo(text) 320 | text = latin_to_bopomofo(text) 321 | text = bopomofo_to_ipa2(text) 322 | text = re.sub(r'i([aoe])', r'j\1', text) 323 | text = re.sub(r'u([aoəe])', r'w\1', text) 324 | text = re.sub(r'([ʂɹ]ʰ?)([˩˨˧˦˥ ]+|$)', r'\1ʅ\2', text) 325 | text = re.sub(r'(sʰ?)([˩˨˧˦˥ ]+|$)', r'\1ɿ\2', text) 326 | return text 327 | -------------------------------------------------------------------------------- /text/ngu_dialect.py: -------------------------------------------------------------------------------- 1 | import re 2 | import opencc 3 | 4 | 5 | dialects = {'SZ': 'suzhou', 'WX': 'wuxi', 'CZ': 'changzhou', 'HZ': 'hangzhou', 6 | 'SX': 'shaoxing', 'NB': 'ningbo', 'JJ': 'jingjiang', 'YX': 'yixing', 7 | 'JD': 'jiading', 'ZR': 'zhenru', 'PH': 'pinghu', 'TX': 'tongxiang', 8 | 'JS': 'jiashan', 'HN': 'xiashi', 'LP': 'linping', 'XS': 'xiaoshan', 9 | 'FY': 'fuyang', 'RA': 'ruao', 'CX': 'cixi', 'SM': 'sanmen', 10 | 'TT': 'tiantai', 'WZ': 'wenzhou', 'SC': 'suichang', 'YB': 'youbu'} 11 | 12 | converters = {} 13 | 14 | for dialect in dialects.values(): 15 | try: 16 | converters[dialect] = opencc.OpenCC(dialect) 17 | except: 18 | pass 19 | 20 | 21 | def ngu_dialect_to_ipa(text, dialect): 22 | dialect = dialects[dialect] 23 | text = converters[dialect].convert(text).replace('-','').replace('$',' ') 24 | text = re.sub(r'[、；：]', '，', text) 25 | text = re.sub(r'\s*，\s*', ', ', text) 26 | text = re.sub(r'\s*。\s*', '. ', text) 27 | text = re.sub(r'\s*？\s*', '? ', text) 28 | text = re.sub(r'\s*！\s*', '! ', text) 29 | text = re.sub(r'\s*$', '', text) 30 | return text 31 | -------------------------------------------------------------------------------- /text/sanskrit.py: -------------------------------------------------------------------------------- 1 | import re 2 | from indic_transliteration import sanscript 3 | 4 | 5 | # List of (iast, ipa) pairs: 6 | _iast_to_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [ 7 | ('a', 'ə'), 8 | ('ā', 'aː'), 9 | ('ī', 'iː'), 10 | ('ū', 'uː'), 11 | ('ṛ', 'ɹ`'), 12 | ('ṝ', 'ɹ`ː'), 13 | ('ḷ', 'l`'), 14 | ('ḹ', 'l`ː'), 15 | ('e', 'eː'), 16 | ('o', 'oː'), 17 | ('k', 'k⁼'), 18 | ('k⁼h', 'kʰ'), 19 | ('g', 'g⁼'), 20 | ('g⁼h', 'gʰ'), 21 | ('ṅ', 'ŋ'), 22 | ('c', 'ʧ⁼'), 23 | ('ʧ⁼h', 'ʧʰ'), 24 | ('j', 'ʥ⁼'), 25 | ('ʥ⁼h', 'ʥʰ'), 26 | ('ñ', 'n^'), 27 | ('ṭ', 't`⁼'), 28 | ('t`⁼h', 't`ʰ'), 29 | ('ḍ', 'd`⁼'), 30 | ('d`⁼h', 'd`ʰ'), 31 | ('ṇ', 'n`'), 32 | ('t', 't⁼'), 33 | ('t⁼h', 'tʰ'), 34 | ('d', 'd⁼'), 35 | ('d⁼h', 'dʰ'), 36 | ('p', 'p⁼'), 37 | ('p⁼h', 'pʰ'), 38 | ('b', 'b⁼'), 39 | ('b⁼h', 'bʰ'), 40 | ('y', 'j'), 41 | ('ś', 'ʃ'), 42 | ('ṣ', 's`'), 43 | ('r', 'ɾ'), 44 | ('l̤', 'l`'), 45 | ('h', 'ɦ'), 46 | ("'", ''), 47 | ('~', '^'), 48 | ('ṃ', '^') 49 | ]] 50 | 51 | 52 | def devanagari_to_ipa(text): 53 | text = text.replace('ॐ', 'ओम्') 54 | text = re.sub(r'\s*।\s*$', '.', text) 55 | text = re.sub(r'\s*।\s*', ', ', text) 56 | text = re.sub(r'\s*॥', '.', text) 57 | text = sanscript.transliterate(text, sanscript.DEVANAGARI, sanscript.IAST) 58 | for regex, replacement in _iast_to_ipa: 59 | text = re.sub(regex, replacement, text) 60 | text = re.sub('(.)[`ː]*ḥ', lambda x: x.group(0) 61 | [:-1]+'h'+x.group(1)+'*', text) 62 | return text 63 | -------------------------------------------------------------------------------- /text/shanghainese.py: -------------------------------------------------------------------------------- 1 | import re 2 | import cn2an 3 | import opencc 4 | 5 | 6 | converter = opencc.OpenCC('zaonhe') 7 | 8 | # List of (Latin alphabet, ipa) pairs: 9 | _latin_to_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [ 10 | ('A', 'ᴇ'), 11 | ('B', 'bi'), 12 | ('C', 'si'), 13 | ('D', 'di'), 14 | ('E', 'i'), 15 | ('F', 'ᴇf'), 16 | ('G', 'dʑi'), 17 | ('H', 'ᴇtɕʰ'), 18 | ('I', 'ᴀi'), 19 | ('J', 'dʑᴇ'), 20 | ('K', 'kʰᴇ'), 21 | ('L', 'ᴇl'), 22 | ('M', 'ᴇm'), 23 | ('N', 'ᴇn'), 24 | ('O', 'o'), 25 | ('P', 'pʰi'), 26 | ('Q', 'kʰiu'), 27 | ('R', 'ᴀl'), 28 | ('S', 'ᴇs'), 29 | ('T', 'tʰi'), 30 | ('U', 'ɦiu'), 31 | ('V', 'vi'), 32 | ('W', 'dᴀbɤliu'), 33 | ('X', 'ᴇks'), 34 | ('Y', 'uᴀi'), 35 | ('Z', 'zᴇ') 36 | ]] 37 | 38 | 39 | def _number_to_shanghainese(num): 40 | num = cn2an.an2cn(num).replace('一十','十').replace('二十', '廿').replace('二', '两') 41 | return re.sub(r'((?:^|[^三四五六七八九])十|廿)两', r'\1二', num) 42 | 43 | 44 | def number_to_shanghainese(text): 45 | return re.sub(r'\d+(?:\.?\d+)?', lambda x: _number_to_shanghainese(x.group()), text) 46 | 47 | 48 | def latin_to_ipa(text): 49 | for regex, replacement in _latin_to_ipa: 50 | text = re.sub(regex, replacement, text) 51 | return text 52 | 53 | 54 | def shanghainese_to_ipa(text): 55 | text = number_to_shanghainese(text.upper()) 56 | text = converter.convert(text).replace('-','').replace('$',' ') 57 | text = re.sub(r'[A-Z]', lambda x: latin_to_ipa(x.group())+' ', text) 58 | text = re.sub(r'[、；：]', '，', text) 59 | text = re.sub(r'\s*，\s*', ', ', text) 60 | text = re.sub(r'\s*。\s*', '. ', text) 61 | text = re.sub(r'\s*？\s*', '? ', text) 62 | text = re.sub(r'\s*！\s*', '! ', text) 63 | text = re.sub(r'\s*$', '', text) 64 | return text 65 | -------------------------------------------------------------------------------- /text/symbols.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Defines the set of symbols used in text input to the model. 3 | ''' 4 | 5 | # japanese_cleaners 6 | # _pad = '_' 7 | # _punctuation = ',.!?-' 8 | # _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧ↓↑ ' 9 | 10 | 11 | '''# japanese_cleaners2 12 | _pad = '_' 13 | _punctuation = ',.!?-~…' 14 | _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧʦ↓↑ ' 15 | ''' 16 | 17 | 18 | '''# korean_cleaners 19 | _pad = '_' 20 | _punctuation = ',.!?…~' 21 | _letters = 'ㄱㄴㄷㄹㅁㅂㅅㅇㅈㅊㅋㅌㅍㅎㄲㄸㅃㅆㅉㅏㅓㅗㅜㅡㅣㅐㅔ ' 22 | ''' 23 | 24 | '''# chinese_cleaners 25 | _pad = '_' 26 | _punctuation = '，。！？—…' 27 | _letters = 'ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩˉˊˇˋ˙ ' 28 | ''' 29 | 30 | # # zh_ja_mixture_cleaners 31 | # _pad = '_' 32 | # _punctuation = ',.!?-~…' 33 | # _letters = 'AEINOQUabdefghijklmnoprstuvwyzʃʧʦɯɹəɥ⁼ʰ`→↓↑ ' 34 | 35 | 36 | '''# sanskrit_cleaners 37 | _pad = '_' 38 | _punctuation = '।' 39 | _letters = 'ँंःअआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलळवशषसहऽािीुूृॄेैोौ्ॠॢ ' 40 | ''' 41 | 42 | '''# cjks_cleaners 43 | _pad = '_' 44 | _punctuation = ',.!?-~…' 45 | _letters = 'NQabdefghijklmnopstuvwxyzʃʧʥʦɯɹəɥçɸɾβŋɦː⁼ʰ`^#*=→↓↑ ' 46 | ''' 47 | 48 | '''# thai_cleaners 49 | _pad = '_' 50 | _punctuation = '.!? ' 51 | _letters = 'กขฃคฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลวศษสหฬอฮฯะัาำิีึืุูเแโใไๅๆ็่้๊๋์' 52 | ''' 53 | 54 | # # cjke_cleaners2 55 | _pad = '_' 56 | _punctuation = ',.!?-~…' 57 | _letters = 'NQabdefghijklmnopstuvwxyzɑæʃʑçɯɪɔɛɹðəɫɥɸʊɾʒθβŋɦ⁼ʰ`^#*=ˈˌ→↓↑ ' 58 | 59 | 60 | '''# shanghainese_cleaners 61 | _pad = '_' 62 | _punctuation = ',.!?…' 63 | _letters = 'abdfghiklmnopstuvyzøŋȵɑɔɕəɤɦɪɿʑʔʰ̩̃ᴀᴇ15678 ' 64 | ''' 65 | 66 | '''# chinese_dialect_cleaners 67 | _pad = '_' 68 | _punctuation = ',.!?~…─' 69 | _letters = '#Nabdefghijklmnoprstuvwxyzæçøŋœȵɐɑɒɓɔɕɗɘəɚɛɜɣɤɦɪɭɯɵɷɸɻɾɿʂʅʊʋʌʏʑʔʦʮʰʷˀː˥˦˧˨˩̥̩̃̚ᴀᴇ↑↓∅ⱼ ' 70 | ''' 71 | 72 | # Export all symbols: 73 | symbols = [_pad] + list(_punctuation) + list(_letters) 74 | 75 | # Special symbol ids 76 | SPACE_ID = symbols.index(" ") 77 | -------------------------------------------------------------------------------- /text/thai.py: -------------------------------------------------------------------------------- 1 | import re 2 | from num_thai.thainumbers import NumThai 3 | 4 | 5 | num = NumThai() 6 | 7 | # List of (Latin alphabet, Thai) pairs: 8 | _latin_to_thai = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [ 9 | ('a', 'เอ'), 10 | ('b','บี'), 11 | ('c','ซี'), 12 | ('d','ดี'), 13 | ('e','อี'), 14 | ('f','เอฟ'), 15 | ('g','จี'), 16 | ('h','เอช'), 17 | ('i','ไอ'), 18 | ('j','เจ'), 19 | ('k','เค'), 20 | ('l','แอล'), 21 | ('m','เอ็ม'), 22 | ('n','เอ็น'), 23 | ('o','โอ'), 24 | ('p','พี'), 25 | ('q','คิว'), 26 | ('r','แอร์'), 27 | ('s','เอส'), 28 | ('t','ที'), 29 | ('u','ยู'), 30 | ('v','วี'), 31 | ('w','ดับเบิลยู'), 32 | ('x','เอ็กซ์'), 33 | ('y','วาย'), 34 | ('z','ซี') 35 | ]] 36 | 37 | 38 | def num_to_thai(text): 39 | return re.sub(r'(?:\d+(?:,?\d+)?)+(?:\.\d+(?:,?\d+)?)?', lambda x: ''.join(num.NumberToTextThai(float(x.group(0).replace(',', '')))), text) 40 | 41 | def latin_to_thai(text): 42 | for regex, replacement in _latin_to_thai: 43 | text = re.sub(regex, replacement, text) 44 | return text 45 | -------------------------------------------------------------------------------- /transforms.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.nn import functional as F 3 | 4 | import numpy as np 5 | 6 | 7 | DEFAULT_MIN_BIN_WIDTH = 1e-3 8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3 9 | DEFAULT_MIN_DERIVATIVE = 1e-3 10 | 11 | 12 | def piecewise_rational_quadratic_transform(inputs, 13 | unnormalized_widths, 14 | unnormalized_heights, 15 | unnormalized_derivatives, 16 | inverse=False, 17 | tails=None, 18 | tail_bound=1., 19 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 20 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 21 | min_derivative=DEFAULT_MIN_DERIVATIVE): 22 | 23 | if tails is None: 24 | spline_fn = rational_quadratic_spline 25 | spline_kwargs = {} 26 | else: 27 | spline_fn = unconstrained_rational_quadratic_spline 28 | spline_kwargs = { 29 | 'tails': tails, 30 | 'tail_bound': tail_bound 31 | } 32 | 33 | outputs, logabsdet = spline_fn( 34 | inputs=inputs, 35 | unnormalized_widths=unnormalized_widths, 36 | unnormalized_heights=unnormalized_heights, 37 | unnormalized_derivatives=unnormalized_derivatives, 38 | inverse=inverse, 39 | min_bin_width=min_bin_width, 40 | min_bin_height=min_bin_height, 41 | min_derivative=min_derivative, 42 | **spline_kwargs 43 | ) 44 | return outputs, logabsdet 45 | 46 | 47 | def searchsorted(bin_locations, inputs, eps=1e-6): 48 | bin_locations[..., -1] += eps 49 | return torch.sum( 50 | inputs[..., None] >= bin_locations, 51 | dim=-1 52 | ) - 1 53 | 54 | 55 | def unconstrained_rational_quadratic_spline(inputs, 56 | unnormalized_widths, 57 | unnormalized_heights, 58 | unnormalized_derivatives, 59 | inverse=False, 60 | tails='linear', 61 | tail_bound=1., 62 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 63 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 64 | min_derivative=DEFAULT_MIN_DERIVATIVE): 65 | inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound) 66 | outside_interval_mask = ~inside_interval_mask 67 | 68 | outputs = torch.zeros_like(inputs) 69 | logabsdet = torch.zeros_like(inputs) 70 | 71 | if tails == 'linear': 72 | unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1)) 73 | constant = np.log(np.exp(1 - min_derivative) - 1) 74 | unnormalized_derivatives[..., 0] = constant 75 | unnormalized_derivatives[..., -1] = constant 76 | 77 | outputs[outside_interval_mask] = inputs[outside_interval_mask] 78 | logabsdet[outside_interval_mask] = 0 79 | else: 80 | raise RuntimeError('{} tails are not implemented.'.format(tails)) 81 | 82 | outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline( 83 | inputs=inputs[inside_interval_mask], 84 | unnormalized_widths=unnormalized_widths[inside_interval_mask, :], 85 | unnormalized_heights=unnormalized_heights[inside_interval_mask, :], 86 | unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :], 87 | inverse=inverse, 88 | left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound, 89 | min_bin_width=min_bin_width, 90 | min_bin_height=min_bin_height, 91 | min_derivative=min_derivative 92 | ) 93 | 94 | return outputs, logabsdet 95 | 96 | def rational_quadratic_spline(inputs, 97 | unnormalized_widths, 98 | unnormalized_heights, 99 | unnormalized_derivatives, 100 | inverse=False, 101 | left=0., right=1., bottom=0., top=1., 102 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 103 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 104 | min_derivative=DEFAULT_MIN_DERIVATIVE): 105 | if torch.min(inputs) < left or torch.max(inputs) > right: 106 | raise ValueError('Input to a transform is not within its domain') 107 | 108 | num_bins = unnormalized_widths.shape[-1] 109 | 110 | if min_bin_width * num_bins > 1.0: 111 | raise ValueError('Minimal bin width too large for the number of bins') 112 | if min_bin_height * num_bins > 1.0: 113 | raise ValueError('Minimal bin height too large for the number of bins') 114 | 115 | widths = F.softmax(unnormalized_widths, dim=-1) 116 | widths = min_bin_width + (1 - min_bin_width * num_bins) * widths 117 | cumwidths = torch.cumsum(widths, dim=-1) 118 | cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0) 119 | cumwidths = (right - left) * cumwidths + left 120 | cumwidths[..., 0] = left 121 | cumwidths[..., -1] = right 122 | widths = cumwidths[..., 1:] - cumwidths[..., :-1] 123 | 124 | derivatives = min_derivative + F.softplus(unnormalized_derivatives) 125 | 126 | heights = F.softmax(unnormalized_heights, dim=-1) 127 | heights = min_bin_height + (1 - min_bin_height * num_bins) * heights 128 | cumheights = torch.cumsum(heights, dim=-1) 129 | cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0) 130 | cumheights = (top - bottom) * cumheights + bottom 131 | cumheights[..., 0] = bottom 132 | cumheights[..., -1] = top 133 | heights = cumheights[..., 1:] - cumheights[..., :-1] 134 | 135 | if inverse: 136 | bin_idx = searchsorted(cumheights, inputs)[..., None] 137 | else: 138 | bin_idx = searchsorted(cumwidths, inputs)[..., None] 139 | 140 | input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0] 141 | input_bin_widths = widths.gather(-1, bin_idx)[..., 0] 142 | 143 | input_cumheights = cumheights.gather(-1, bin_idx)[..., 0] 144 | delta = heights / widths 145 | input_delta = delta.gather(-1, bin_idx)[..., 0] 146 | 147 | input_derivatives = derivatives.gather(-1, bin_idx)[..., 0] 148 | input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0] 149 | 150 | input_heights = heights.gather(-1, bin_idx)[..., 0] 151 | 152 | if inverse: 153 | a = (((inputs - input_cumheights) * (input_derivatives 154 | + input_derivatives_plus_one 155 | - 2 * input_delta) 156 | + input_heights * (input_delta - input_derivatives))) 157 | b = (input_heights * input_derivatives 158 | - (inputs - input_cumheights) * (input_derivatives 159 | + input_derivatives_plus_one 160 | - 2 * input_delta)) 161 | c = - input_delta * (inputs - input_cumheights) 162 | 163 | discriminant = b.pow(2) - 4 * a * c 164 | assert (discriminant >= 0).all() 165 | 166 | root = (2 * c) / (-b - torch.sqrt(discriminant)) 167 | outputs = root * input_bin_widths + input_cumwidths 168 | 169 | theta_one_minus_theta = root * (1 - root) 170 | denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta) 171 | * theta_one_minus_theta) 172 | derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2) 173 | + 2 * input_delta * theta_one_minus_theta 174 | + input_derivatives * (1 - root).pow(2)) 175 | logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator) 176 | 177 | return outputs, -logabsdet 178 | else: 179 | theta = (inputs - input_cumwidths) / input_bin_widths 180 | theta_one_minus_theta = theta * (1 - theta) 181 | 182 | numerator = input_heights * (input_delta * theta.pow(2) 183 | + input_derivatives * theta_one_minus_theta) 184 | denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta) 185 | * theta_one_minus_theta) 186 | outputs = input_cumheights + numerator / denominator 187 | 188 | derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2) 189 | + 2 * input_delta * theta_one_minus_theta 190 | + input_derivatives * (1 - theta).pow(2)) 191 | logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator) 192 | 193 | return outputs, logabsdet 194 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import sys 4 | import argparse 5 | import logging 6 | import json 7 | import subprocess 8 | import numpy as np 9 | from scipy.io.wavfile import read 10 | import torch 11 | import regex as re 12 | 13 | MATPLOTLIB_FLAG = False 14 | 15 | logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) 16 | logger = logging 17 | 18 | 19 | 20 | zh_pattern = re.compile(r'[\u4e00-\u9fa5]') 21 | en_pattern = re.compile(r'[a-zA-Z]') 22 | jp_pattern = re.compile(r'[\u3040-\u30ff\u31f0-\u31ff]') 23 | kr_pattern = re.compile(r'[\uac00-\ud7af\u1100-\u11ff\u3130-\u318f\ua960-\ua97f]') 24 | num_pattern=re.compile(r'[0-9]') 25 | comma=r"(?<=[.。!！?？；;，,、:：'\"‘“”’()（）《》「」~——])" #向前匹配但固定长度 26 | tags={'ZH':'[ZH]','EN':'[EN]','JP':'[JA]','KR':'[KR]'} 27 | 28 | def tag_cjke(text): 29 | '''为中英日韩加tag,中日正则分不开，故先分句分离中日再识别，以应对大部分情况''' 30 | sentences = re.split(r"([.。!！?？；;，,、:：'\"‘“”’()（）【】《》「」~——]+ *(?![0-9]))", text) #分句，排除小数点 31 | sentences.append("") 32 | sentences = ["".join(i) for i in zip(sentences[0::2],sentences[1::2])] 33 | # print(sentences) 34 | prev_lang=None 35 | tagged_text = "" 36 | for s in sentences: 37 | #全为符号跳过 38 | nu = re.sub(r'[\s\p{P}]+', '', s, flags=re.U).strip() 39 | if len(nu)==0: 40 | continue 41 | s = re.sub(r'[()（）《》「」【】‘“”’]+', '', s) 42 | jp=re.findall(jp_pattern, s) 43 | #本句含日语字符判断为日语 44 | if len(jp)>0: 45 | prev_lang,tagged_jke=tag_jke(s,prev_lang) 46 | tagged_text +=tagged_jke 47 | else: 48 | prev_lang,tagged_cke=tag_cke(s,prev_lang) 49 | tagged_text +=tagged_cke 50 | return tagged_text 51 | 52 | def tag_jke(text,prev_sentence=None): 53 | '''为英日韩加tag''' 54 | # 初始化标记变量 55 | tagged_text = "" 56 | prev_lang = None 57 | tagged=0 58 | # 遍历文本 59 | for char in text: 60 | # 判断当前字符属于哪种语言 61 | if jp_pattern.match(char): 62 | lang = "JP" 63 | elif zh_pattern.match(char): 64 | lang = "JP" 65 | elif kr_pattern.match(char): 66 | lang = "KR" 67 | elif en_pattern.match(char): 68 | lang = "EN" 69 | # elif num_pattern.match(char): 70 | # lang = prev_sentence 71 | else: 72 | lang = None 73 | tagged_text += char 74 | continue 75 | # 如果当前语言与上一个语言不同，就添加标记 76 | if lang != prev_lang: 77 | tagged=1 78 | if prev_lang==None: # 开头 79 | tagged_text =tags[lang]+tagged_text 80 | else: 81 | tagged_text =tagged_text+tags[prev_lang]+tags[lang] 82 | 83 | # 重置标记变量 84 | prev_lang = lang 85 | 86 | # 添加当前字符到标记文本中 87 | tagged_text += char 88 | 89 | # 在最后一个语言的结尾添加对应的标记 90 | if prev_lang: 91 | tagged_text += tags[prev_lang] 92 | if not tagged: 93 | prev_lang=prev_sentence 94 | tagged_text =tags[prev_lang]+tagged_text+tags[prev_lang] 95 | 96 | return prev_lang,tagged_text 97 | 98 | def tag_cke(text,prev_sentence=None): 99 | '''为中英韩加tag''' 100 | # 初始化标记变量 101 | tagged_text = "" 102 | prev_lang = None 103 | # 是否全略过未标签 104 | tagged=0 105 | 106 | # 遍历文本 107 | for char in text: 108 | # 判断当前字符属于哪种语言 109 | if zh_pattern.match(char): 110 | lang = "ZH" 111 | elif kr_pattern.match(char): 112 | lang = "KR" 113 | elif en_pattern.match(char): 114 | lang = "EN" 115 | # elif num_pattern.match(char): 116 | # lang = prev_sentence 117 | else: 118 | # 略过 119 | lang = None 120 | tagged_text += char 121 | continue 122 | 123 | # 如果当前语言与上一个语言不同，添加标记 124 | if lang != prev_lang: 125 | tagged=1 126 | if prev_lang==None: # 开头 127 | tagged_text =tags[lang]+tagged_text 128 | else: 129 | tagged_text =tagged_text+tags[prev_lang]+tags[lang] 130 | 131 | # 重置标记变量 132 | prev_lang = lang 133 | 134 | # 添加当前字符到标记文本中 135 | tagged_text += char 136 | 137 | # 在最后一个语言的结尾添加对应的标记 138 | if prev_lang: 139 | tagged_text += tags[prev_lang] 140 | # 未标签则继承上一句标签 141 | if tagged==0: 142 | prev_lang=prev_sentence 143 | tagged_text =tags[prev_lang]+tagged_text+tags[prev_lang] 144 | return prev_lang,tagged_text 145 | 146 | 147 | 148 | def load_checkpoint(checkpoint_path, model, optimizer=None, drop_speaker_emb=False): 149 | assert os.path.isfile(checkpoint_path) 150 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 151 | iteration = checkpoint_dict['iteration'] 152 | learning_rate = checkpoint_dict['learning_rate'] 153 | if optimizer is not None: 154 | optimizer.load_state_dict(checkpoint_dict['optimizer']) 155 | saved_state_dict = checkpoint_dict['model'] 156 | if hasattr(model, 'module'): 157 | state_dict = model.module.state_dict() 158 | else: 159 | state_dict = model.state_dict() 160 | new_state_dict = {} 161 | for k, v in state_dict.items(): 162 | try: 163 | if k == 'emb_g.weight': 164 | if drop_speaker_emb: 165 | new_state_dict[k] = v 166 | continue 167 | v[:saved_state_dict[k].shape[0], :] = saved_state_dict[k] 168 | new_state_dict[k] = v 169 | else: 170 | new_state_dict[k] = saved_state_dict[k] 171 | except: 172 | logger.info("%s is not in the checkpoint" % k) 173 | new_state_dict[k] = v 174 | if hasattr(model, 'module'): 175 | model.module.load_state_dict(new_state_dict) 176 | else: 177 | model.load_state_dict(new_state_dict) 178 | logger.info("Loaded checkpoint '{}' (iteration {})".format( 179 | checkpoint_path, iteration)) 180 | return model, optimizer, learning_rate, iteration 181 | 182 | 183 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path): 184 | logger.info("Saving model and optimizer state at iteration {} to {}".format( 185 | iteration, checkpoint_path)) 186 | if hasattr(model, 'module'): 187 | state_dict = model.module.state_dict() 188 | else: 189 | state_dict = model.state_dict() 190 | torch.save({'model': state_dict, 191 | 'iteration': iteration, 192 | 'optimizer': optimizer.state_dict() if optimizer is not None else None, 193 | 'learning_rate': learning_rate}, checkpoint_path) 194 | 195 | 196 | def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050): 197 | for k, v in scalars.items(): 198 | writer.add_scalar(k, v, global_step) 199 | for k, v in histograms.items(): 200 | writer.add_histogram(k, v, global_step) 201 | for k, v in images.items(): 202 | writer.add_image(k, v, global_step, dataformats='HWC') 203 | for k, v in audios.items(): 204 | writer.add_audio(k, v, global_step, audio_sampling_rate) 205 | 206 | 207 | def extract_digits(f): 208 | digits = "".join(filter(str.isdigit, f)) 209 | return int(digits) if digits else -1 210 | 211 | 212 | def latest_checkpoint_path(dir_path, regex="G_[0-9]*.pth"): 213 | f_list = glob.glob(os.path.join(dir_path, regex)) 214 | f_list.sort(key=lambda f: extract_digits(f)) 215 | x = f_list[-1] 216 | print(f"latest_checkpoint_path:{x}") 217 | return x 218 | 219 | 220 | def oldest_checkpoint_path(dir_path, regex="G_[0-9]*.pth", preserved=4): 221 | f_list = glob.glob(os.path.join(dir_path, regex)) 222 | f_list.sort(key=lambda f: extract_digits(f)) 223 | if len(f_list) > preserved: 224 | x = f_list[0] 225 | print(f"oldest_checkpoint_path:{x}") 226 | return x 227 | return "" 228 | 229 | 230 | def plot_spectrogram_to_numpy(spectrogram): 231 | global MATPLOTLIB_FLAG 232 | if not MATPLOTLIB_FLAG: 233 | import matplotlib 234 | matplotlib.use("Agg") 235 | MATPLOTLIB_FLAG = True 236 | mpl_logger = logging.getLogger('matplotlib') 237 | mpl_logger.setLevel(logging.WARNING) 238 | import matplotlib.pylab as plt 239 | import numpy as np 240 | 241 | fig, ax = plt.subplots(figsize=(10, 2)) 242 | im = ax.imshow(spectrogram, aspect="auto", origin="lower", 243 | interpolation='none') 244 | plt.colorbar(im, ax=ax) 245 | plt.xlabel("Frames") 246 | plt.ylabel("Channels") 247 | plt.tight_layout() 248 | 249 | fig.canvas.draw() 250 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 251 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 252 | plt.close() 253 | return data 254 | 255 | 256 | def plot_alignment_to_numpy(alignment, info=None): 257 | global MATPLOTLIB_FLAG 258 | if not MATPLOTLIB_FLAG: 259 | import matplotlib 260 | matplotlib.use("Agg") 261 | MATPLOTLIB_FLAG = True 262 | mpl_logger = logging.getLogger('matplotlib') 263 | mpl_logger.setLevel(logging.WARNING) 264 | import matplotlib.pylab as plt 265 | import numpy as np 266 | 267 | fig, ax = plt.subplots(figsize=(6, 4)) 268 | im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower', 269 | interpolation='none') 270 | fig.colorbar(im, ax=ax) 271 | xlabel = 'Decoder timestep' 272 | if info is not None: 273 | xlabel += '\n\n' + info 274 | plt.xlabel(xlabel) 275 | plt.ylabel('Encoder timestep') 276 | plt.tight_layout() 277 | 278 | fig.canvas.draw() 279 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 280 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 281 | plt.close() 282 | return data 283 | 284 | 285 | def load_wav_to_torch(full_path): 286 | sampling_rate, data = read(full_path) 287 | return torch.FloatTensor(data.astype(np.float32)), sampling_rate 288 | 289 | 290 | def load_filepaths_and_text(filename, split="|"): 291 | with open(filename, encoding='utf-8') as f: 292 | filepaths_and_text = [line.strip().split(split) for line in f] 293 | return filepaths_and_text 294 | 295 | 296 | def str2bool(v): 297 | if isinstance(v, bool): 298 | return v 299 | if v.lower() in ('yes', 'true', 't', 'y', '1'): 300 | return True 301 | elif v.lower() in ('no', 'false', 'f', 'n', '0'): 302 | return False 303 | else: 304 | raise argparse.ArgumentTypeError('Boolean value expected.') 305 | 306 | 307 | def get_hparams(init=True): 308 | parser = argparse.ArgumentParser() 309 | parser.add_argument('-c', '--config', type=str, default="./configs/modified_finetune_speaker.json", 310 | help='JSON file for configuration') 311 | parser.add_argument('-m', '--model', type=str, default="pretrained_models", 312 | help='Model name') 313 | parser.add_argument('-n', '--max_epochs', type=int, default=50, 314 | help='finetune epochs') 315 | parser.add_argument('--cont', type=str2bool, default=False, help='whether to continue training on the latest checkpoint') 316 | parser.add_argument('--drop_speaker_embed', type=str2bool, default=False, help='whether to drop existing characters') 317 | parser.add_argument('--train_with_pretrained_model', type=str2bool, default=True, 318 | help='whether to train with pretrained model') 319 | parser.add_argument('--preserved', type=int, default=4, 320 | help='Number of preserved models') 321 | 322 | args = parser.parse_args() 323 | model_dir = os.path.join("./", args.model) 324 | 325 | if not os.path.exists(model_dir): 326 | os.makedirs(model_dir) 327 | 328 | config_path = args.config 329 | config_save_path = os.path.join(model_dir, "config.json") 330 | if init: 331 | with open(config_path, "r") as f: 332 | data = f.read() 333 | with open(config_save_path, "w") as f: 334 | f.write(data) 335 | else: 336 | with open(config_save_path, "r") as f: 337 | data = f.read() 338 | config = json.loads(data) 339 | 340 | hparams = HParams(**config) 341 | hparams.model_dir = model_dir 342 | hparams.max_epochs = args.max_epochs 343 | hparams.cont = args.cont 344 | hparams.drop_speaker_embed = args.drop_speaker_embed 345 | hparams.train_with_pretrained_model = args.train_with_pretrained_model 346 | hparams.preserved = args.preserved 347 | return hparams 348 | 349 | 350 | def get_hparams_from_dir(model_dir): 351 | config_save_path = os.path.join(model_dir, "config.json") 352 | with open(config_save_path, "r") as f: 353 | data = f.read() 354 | config = json.loads(data) 355 | 356 | hparams = HParams(**config) 357 | hparams.model_dir = model_dir 358 | return hparams 359 | 360 | 361 | def get_hparams_from_file(config_path): 362 | with open(config_path, "r", encoding="utf-8") as f: 363 | data = f.read() 364 | config = json.loads(data) 365 | 366 | hparams = HParams(**config) 367 | return hparams 368 | 369 | 370 | def check_git_hash(model_dir): 371 | source_dir = os.path.dirname(os.path.realpath(__file__)) 372 | if not os.path.exists(os.path.join(source_dir, ".git")): 373 | logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format( 374 | source_dir 375 | )) 376 | return 377 | 378 | cur_hash = subprocess.getoutput("git rev-parse HEAD") 379 | 380 | path = os.path.join(model_dir, "githash") 381 | if os.path.exists(path): 382 | saved_hash = open(path).read() 383 | if saved_hash != cur_hash: 384 | logger.warn("git hash values are different. {}(saved) != {}(current)".format( 385 | saved_hash[:8], cur_hash[:8])) 386 | else: 387 | open(path, "w").write(cur_hash) 388 | 389 | 390 | def get_logger(model_dir, filename="train.log"): 391 | global logger 392 | logger = logging.getLogger(os.path.basename(model_dir)) 393 | logger.setLevel(logging.DEBUG) 394 | 395 | formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s") 396 | if not os.path.exists(model_dir): 397 | os.makedirs(model_dir) 398 | h = logging.FileHandler(os.path.join(model_dir, filename),encoding="utf-8") 399 | h.setLevel(logging.DEBUG) 400 | h.setFormatter(formatter) 401 | logger.addHandler(h) 402 | return logger 403 | 404 | 405 | class HParams(): 406 | def __init__(self, **kwargs): 407 | for k, v in kwargs.items(): 408 | if type(v) == dict: 409 | v = HParams(**v) 410 | self[k] = v 411 | 412 | def keys(self): 413 | return self.__dict__.keys() 414 | 415 | def items(self): 416 | return self.__dict__.items() 417 | 418 | def values(self): 419 | return self.__dict__.values() 420 | 421 | def __len__(self): 422 | return len(self.__dict__) 423 | 424 | def __getitem__(self, key): 425 | return getattr(self, key) 426 | 427 | def __setitem__(self, key, value): 428 | return setattr(self, key, value) 429 | 430 | def __contains__(self, key): 431 | return key in self.__dict__ 432 | 433 | def __repr__(self): 434 | return self.__dict__.__repr__() --------------------------------------------------------------------------------