├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── app.py ├── attentions.py ├── bert ├── ProsodyModel.py ├── __init__.py ├── config.json ├── prosody_tool.py └── vocab.txt ├── commons.py ├── configs ├── bert_vits.json └── bert_vits_student.json ├── data └── 000001-010000.txt ├── data_utils.py ├── filelists ├── all.txt ├── train.txt └── valid.txt ├── losses.py ├── mel_processing.py ├── model_onnx.py ├── model_onnx_stream.py ├── models.py ├── modules.py ├── monotonic_align ├── __init__.py ├── core.pyx └── setup.py ├── requirements.txt ├── text ├── __init__.py ├── pinyin-local.txt └── symbols.py ├── train.py ├── transforms.py ├── utils.py ├── vits_infer.py ├── vits_infer_item.txt ├── vits_infer_no_bert.py ├── vits_infer_onnx.py ├── vits_infer_onnx_stream.py ├── vits_infer_out ├── bert_vits_1.wav ├── bert_vits_2.wav └── bert_vits_3.wav ├── vits_infer_pause.py ├── vits_infer_stream.py ├── vits_pinyin.py ├── vits_prepare.py └── vits_resample.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, caste, color, religion, or sexual 10 | identity and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the overall 26 | community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or advances of 31 | any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email address, 35 | without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement. 63 | All complaints will be reviewed and investigated promptly and fairly. 64 | 65 | All community leaders are obligated to respect the privacy and security of the 66 | reporter of any incident. 67 | 68 | ## Enforcement Guidelines 69 | 70 | Community leaders will follow these Community Impact Guidelines in determining 71 | the consequences for any action they deem in violation of this Code of Conduct: 72 | 73 | ### 1. Correction 74 | 75 | **Community Impact**: Use of inappropriate language or other behavior deemed 76 | unprofessional or unwelcome in the community. 77 | 78 | **Consequence**: A private, written warning from community leaders, providing 79 | clarity around the nature of the violation and an explanation of why the 80 | behavior was inappropriate. A public apology may be requested. 81 | 82 | ### 2. Warning 83 | 84 | **Community Impact**: A violation through a single incident or series of 85 | actions. 86 | 87 | **Consequence**: A warning with consequences for continued behavior. No 88 | interaction with the people involved, including unsolicited interaction with 89 | those enforcing the Code of Conduct, for a specified period of time. This 90 | includes avoiding interactions in community spaces as well as external channels 91 | like social media. Violating these terms may lead to a temporary or permanent 92 | ban. 93 | 94 | ### 3. Temporary Ban 95 | 96 | **Community Impact**: A serious violation of community standards, including 97 | sustained inappropriate behavior. 98 | 99 | **Consequence**: A temporary ban from any sort of interaction or public 100 | communication with the community for a specified period of time. No public or 101 | private interaction with the people involved, including unsolicited interaction 102 | with those enforcing the Code of Conduct, is allowed during this period. 103 | Violating these terms may lead to a permanent ban. 104 | 105 | ### 4. Permanent Ban 106 | 107 | **Community Impact**: Demonstrating a pattern of violation of community 108 | standards, including sustained inappropriate behavior, harassment of an 109 | individual, or aggression toward or disparagement of classes of individuals. 110 | 111 | **Consequence**: A permanent ban from any sort of public interaction within the 112 | community. 113 | 114 | ## Attribution 115 | 116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 117 | version 2.1, available at 118 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 119 | 120 | Community Impact Guidelines were inspired by 121 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 122 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | First off, thanks for taking the time to contribute!👏 4 | 5 | ## Fork the Repository 🍴 6 | 7 | 1. Start by forking the repository. You can do this by clicking the "Fork" button in the upper right corner of the repository page. This will create a copy of the repository in your GitHub account. 8 | 9 | ## Clone Your Fork 📥 10 | 11 | 2. Clone your newly created fork of the repository to your local machine with the following command: 12 | 13 | ```bash 14 | git clone https://github.com/your-username/static_status.git 15 | ``` 16 | 17 | ## Create a New Branch 🌿 18 | 19 | 3. Create a new branch for the specific issue or feature you are working on. Use a descriptive branch name: 20 | 21 | ```bash 22 | git checkout -b "branch_name" 23 | ``` 24 | 25 | ## Submitting Changes 🚀 26 | 27 | 4. Make your desired changes to the codebase. 28 | 29 | 5. Stage your changes using the following command: 30 | 31 | ```bash 32 | git add . 33 | ``` 34 | 35 | 6. Commit your changes with a clear and concise commit message: 36 | 37 | ```bash 38 | git commit -m "A brief summary of the commit." 39 | ``` 40 | 41 | ## Push Your Changes 🚢 42 | 43 | 7. Push your local commits to your remote repository: 44 | 45 | ```bash 46 | git push origin "branch_name" 47 | ``` 48 | 49 | ## Create a Pull Request 🌟 50 | 51 | 8. Go to your forked repository on GitHub and click on the "New Pull Request" button. This will open a new pull request to the original repository. 52 | 53 | ## Coding Style 📝 54 | 55 | Start reading the code, and you'll get the hang of it. It is optimized for readability: 56 | 57 | - Variables must be uppercase and should begin with MY\_. 58 | - Functions must be lowercase. 59 | - Check your shell scripts with ShellCheck before submitting. 60 | - Please use tabs to indent. 61 | 62 | One more thing: 63 | 64 | Keep it simple! 👍 65 | 66 | Thanks! ❤️❤️❤️ 67 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 PlayVoice 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft 2 | 3 | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/maxmax20160403/vits_chinese) 4 | GitHub Repo stars 5 | GitHub forks 6 | GitHub issues 7 | GitHub 8 | 9 | ## 这是一个用于TTS算法学习的项目,如果您在寻找直接用于生产的TTS,本项目可能不适合您! 10 | https://user-images.githubusercontent.com/16432329/220678182-4775dec8-9229-4578-870f-2eebc3a5d660.mp4 11 | 12 | > 天空呈现的透心的蓝,像极了当年。总在这样的时候,透过窗棂,心,在天空里无尽的游弋!柔柔的,浓浓的,痴痴的风,牵引起心底灵动的思潮;情愫悠悠,思情绵绵,风里默坐,红尘中的浅醉,诗词中的优柔,任那自在飞花轻似梦的情怀,裁一束霓衣,织就清浅淡薄的安寂。 13 | > 14 | > 风的影子翻阅过淡蓝色的信笺,柔和的文字浅浅地漫过我安静的眸,一如几朵悠闲的云儿,忽而氤氲成汽,忽而修饰成花,铅华洗尽后的透彻和靓丽,爽爽朗朗,轻轻盈盈 15 | > 16 | > 时光仿佛有穿越到了从前,在你诗情画意的眼波中,在你舒适浪漫的暇思里,我如风中的思绪徜徉广阔天际,仿佛一片沾染了快乐的羽毛,在云环影绕颤动里浸润着风的呼吸,风的诗韵,那清新的耳语,那婉约的甜蜜,那恬淡的温馨,将一腔情澜染得愈发的缠绵。 17 | 18 | ### Features,特性 19 | 1, Hidden prosody embedding from **BERT**,get natural pauses in grammar 20 | 21 | 2, Infer loss from **NaturalSpeech**,get less sound error 22 | 23 | 3, Framework of **VITS**,get high audio quality 24 | 25 | 4, Module-wise Distillation, get speedup 26 | 27 | :heartpulse:**Tip**: It is recommended to use **Infer Loss** fine-tune model after base model trained, and freeze **PosteriorEncoder** during fine-tuning. 28 | 29 | :heartpulse:**意思就是:初步训练时,不用loss_kl_r;训练好后,添加loss_kl_r继续训练,稍微训练一下就行了,如果音频质量差,可以给loss_kl_r乘以一个小于1的系数、降低loss_kl_r对模型的贡献;继续训练时,可以尝试冻结音频编码器Posterior Encoder;总之,玩法很多,需要多尝试!** 30 | 31 |
32 | 33 | ![naturalspeech](https://github.com/PlayVoice/vits_chinese/assets/16432329/0d7ceb00-f159-40a4-8897-b3f2a3c824d3) 34 | 35 |
36 | 37 | ### 为什么不升级为VITS2 38 | VITS2最重要的改进是将Flow的WaveNet模块使用Transformer替换,而在TTS流式实现中,通常需要用纯CNN替换Transformer。 39 | 40 | ### Online demo,在线体验 41 | https://huggingface.co/spaces/maxmax20160403/vits_chinese 42 | 43 | ### Install,安装依赖和MAS对齐 44 | 45 | > pip install -r requirements.txt 46 | 47 | > cd monotonic_align 48 | 49 | > python setup.py build_ext --inplace 50 | 51 | ### Infer with Pretrained model,用示例模型推理 52 | 53 | Get from release page [vits_chinese/releases/](https://github.com/PlayVoice/vits_chinese/releases/tag/v1.0) 54 | 55 | put [prosody_model.pt](https://github.com/PlayVoice/vits_chinese/releases/tag/v1.0) To ./bert/prosody_model.pt 56 | 57 | put [vits_bert_model.pth](https://github.com/PlayVoice/vits_chinese/releases/tag/v1.0) To ./vits_bert_model.pth 58 | 59 | ``` 60 | python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth 61 | ``` 62 | 63 | ./vits_infer_out have the waves inferred, listen !!! 64 | 65 | ### Infer with chunk wave streaming out,分块流式推理 66 | 67 | as key parameter, ***hop_frame = ∑decoder.ups.padding*** :heartpulse: 68 | 69 | ``` 70 | python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth 71 | ``` 72 | 73 | ### Ceil duration affect naturalness 74 | So change **w_ceil = torch.ceil(w)** to **w_ceil = torch.ceil(w + 0.35)** 75 | 76 | ### All Thanks To Our Contributors: 77 | 78 | 79 | 80 | 81 | ### Train,训练 82 | download baker data [https://aistudio.baidu.com/datasetdetail/36741](https://aistudio.baidu.com/datasetdetail/36741), more info: https://www.data-baker.com/data/index/TNtts/ 83 | 84 | change sample rate of waves to **16kHz**, and put waves to ./data/waves 85 | 86 | ``` 87 | python vits_resample.py -w [input path]:[./data/Wave/] -o ./data/waves -s 16000 88 | ``` 89 | 90 | put 000001-010000.txt to ./data/000001-010000.txt 91 | 92 | ``` 93 | python vits_prepare.py -c ./configs/bert_vits.json 94 | ``` 95 | 96 | ``` 97 | python train.py -c configs/bert_vits.json -m bert_vits 98 | ``` 99 | 100 | ![bert_lose](https://user-images.githubusercontent.com/16432329/220883346-c382bea2-1d2f-4a16-b797-2f9e2d2fb639.png) 101 | 102 | ### 额外说明 103 | 104 | 原始标注为 105 | ``` c 106 | 000001 卡尔普#2陪外孙#1玩滑梯#4。 107 | ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1 108 | 000002 假语村言#2别再#1拥抱我#4。 109 | jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3 110 | ``` 111 | 112 | 标注规整后: 113 | - BERT需要汉字 `卡尔普陪外孙玩滑梯。` (包括标点) 114 | - TTS需要声韵母 `sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil` 115 | ``` c 116 | 000001 卡尔普陪外孙玩滑梯。 117 | ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1 118 | sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil 119 | 000002 假语村言别再拥抱我。 120 | jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3 121 | sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil 122 | ``` 123 | 124 | 训练标注为 125 | ``` 126 | ./data/wavs/000001.wav|./data/temps/000001.spec.pt|./data/berts/000001.npy|sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil 127 | ./data/wavs/000002.wav|./data/temps/000002.spec.pt|./data/berts/000002.npy|sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil 128 | ``` 129 | 130 | 遇到这句话会出错 131 | ``` 132 | 002365 这图#2难不成#2是#1P过的#4? 133 | zhe4 tu2 nan2 bu4 cheng2 shi4 P IY1 guo4 de5 134 | ``` 135 | 136 | ### 拼音错误修改 137 | 将正确的词语和拼音写入文件: [./text/pinyin-local.txt](./text/pinyin-local.txt) 138 | ``` 139 | 渐渐 jian4 jian4 140 | 浅浅 qian3 qian3 141 | ``` 142 | 143 | ### 数字播报支持 144 | 已支持,基于WeNet开源社区[WeTextProcessing](https://github.com/wenet-e2e/WeTextProcessing);当然,不可能是完美的 145 | 146 | ### 不使用Bert也能推理 147 | ``` 148 | python vits_infer_no_bert.py --config ./configs/bert_vits.json --model vits_bert_model.pth 149 | ``` 150 | 虽然训练使用了Bert,但推理可以完全不用Bert,牺牲自然停顿来适配低计算资源设备,比如手机 151 | 152 | 低资源设备通常会分句合成,这样牺牲的自然停顿也没那么明显 153 | 154 | ### ONNX非流式 155 | 导出:会有许多警告,直接忽略 156 | ``` 157 | python model_onnx.py --config configs/bert_vits.json --model vits_bert_model.pth 158 | ``` 159 | 推理 160 | ``` 161 | python vits_infer_onnx.py --model vits-chinese.onnx 162 | ``` 163 | 164 | ### ONNX流式 165 | 166 | 具体实现,将VITS拆解为两个模型,取名为Encoder和Decoder。 167 | 168 | - Encoder包括TextEncoder与DurationPredictor等; 169 | 170 | - Decoder包括ResidualCouplingBlock与Generator等; 171 | 172 | - ResidualCouplingBlock,即Flow,可以放在Encoder或Decoder,放在Decoder需要更大的**hop_frame** 173 | 174 | 并且将推理逻辑也进行了切分;特别的,先验分布的采样过程放在了Encoder中: 175 | ``` 176 | z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale 177 | ``` 178 | 179 | ONNX流式模型导出 180 | ``` 181 | python model_onnx_stream.py --config configs/bert_vits.json --model vits_bert_model.pth 182 | ``` 183 | 184 | ONNX流式模型推理 185 | ``` 186 | python vits_infer_onnx_stream.py --encoder vits-chinese-encoder.onnx --decoder vits-chinese-decoder.onnx 187 | ``` 188 | 189 | 在流式推理中,**hop_frame**是一个重要参数,需要去尝试出合适的值 190 | 191 | ### Model compression based on knowledge distillation,应该叫迁移学习还是知识蒸馏呢? 192 | Student model has 53M size and 3× speed of teacher model. 193 | 194 | To train: 195 | 196 | ``` 197 | python train.py -c configs/bert_vits_student.json -m bert_vits_student 198 | ``` 199 | 200 | To infer, get [student model](https://github.com/PlayVoice/vits_chinese/releases/tag/v2.0) at the release page 201 | 202 | ``` 203 | python vits_infer.py --config ./configs/bert_vits_student.json --model vits_bert_student.pth 204 | ``` 205 | 206 | ### 代码来源 207 | [Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality](https://arxiv.org/abs/2205.04421) 208 | 209 | [Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation](https://arxiv.org/abs/2203.15643) 210 | 211 | https://github.com/Executedone/Chinese-FastSpeech2 **bert prosody** 212 | 213 | https://github.com/wenet-e2e/WeTextProcessing 214 | 215 | [https://github.com/TensorSpeech/TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/baker.py) **Heavily depend on** 216 | 217 | https://github.com/jaywalnut310/vits 218 | 219 | https://github.com/wenet-e2e/wetts 220 | 221 | https://github.com/csukuangfj **onnx and android** 222 | 223 | ### BERT应用于TTS 224 | 2019 BERT+Tacotron2: Pre-trained text embeddings for enhanced text-tospeech synthesis. 225 | 226 | 2020 BERT+Tacotron2-MultiSpeaker: Improving prosody with linguistic and bert derived features in multi-speaker based mandarin chinese neural tts. 227 | 228 | 2021 BERT+Tacotron2: Extracting and predicting word-level style variations for speech synthesis. 229 | 230 | 2022 https://github.com/Executedone/Chinese-FastSpeech2 231 | 232 | 2023 BERT+VISINGER: Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information. 233 | 234 | # AISHELL3多发音人训练,训练出的模型可用于克隆 235 | 切换代码分支[bert_vits_aishell3](https://github.com/PlayVoice/vits_chinese/tree/bert_vits_aishell3),对比分支代码可以看到**针对多发音人所做出的修改** 236 | 237 | ## 数据下载 238 | http://www.openslr.org/93/ 239 | 240 | ## 采样率转换 241 | ``` 242 | python prep_resample.py --wav aishell-3/train/wav/ --out vits_data/waves-16k 243 | ``` 244 | 245 | ## 标注规范化(labels.txt,名称不能改) 246 | ``` 247 | python prep_format_label.py --txt aishell-3/train/content.txt --out vits_data/labels.txt 248 | ``` 249 | 250 | - 原始标注 251 | ``` 252 | SSB00050001.wav 广 guang3 州 zhou1 女 nv3 大 da4 学 xue2 生 sheng1 登 deng1 山 shan1 失 shi1 联 lian2 四 si4 天 tian1 警 jing3 方 fang1 找 zhao3 到 dao4 疑 yi2 似 si4 女 nv3 尸 shi1 253 | SSB00050002.wav 尊 zhun1 重 zhong4 科 ke1 学 xue2 规 gui1 律 lv4 的 de5 要 yao1 求 qiu2 254 | SSB00050003.wav 七 qi1 路 lu4 无 wu2 人 ren2 售 shou4 票 piao4 255 | ``` 256 | - 规范标注 257 | ``` 258 | SSB00050001.wav 广州女大学生登山失联四天警方找到疑似女尸 259 | guang3 zhou1 nv3 da4 xue2 sheng1 deng1 shan1 shi1 lian2 si4 tian1 jing3 fang1 zhao3 dao4 yi2 si4 nv3 shi1 260 | SSB00050002.wav 尊重科学规律的要求 261 | zhun1 zhong4 ke1 xue2 gui1 lv4 de5 yao1 qiu2 262 | SSB00050003.wav 七路无人售票 263 | qi1 lu4 wu2 ren2 shou4 piao4 264 | ``` 265 | ## 数据预处理 266 | ``` 267 | python prep_bert.py --conf configs/bert_vits.json --data vits_data/ 268 | ``` 269 | 270 | 打印信息,在过滤本项目不支持的**儿化音** 271 | 272 | 生成 vits_data/speakers.txt 273 | ``` 274 | {'SSB0005': 0, 'SSB0009': 1, 'SSB0011': 2..., 'SSB1956': 173} 275 | ``` 276 | 生成 filelists 277 | ``` 278 | 0|vits_data/waves-16k/SSB0005/SSB00050001.wav|vits_data/temps/SSB0005/SSB00050001.spec.pt|vits_data/berts/SSB0005/SSB00050001.npy|sil g uang3 zh ou1 n v3 d a4 x ve2 sh eng1 d eng1 sh an1 sh iii1 l ian2 s ii4 t ian1 j ing3 f ang1 zh ao3 d ao4 ^ i2 s ii4 n v3 sh iii1 sil 279 | 0|vits_data/waves-16k/SSB0005/SSB00050002.wav|vits_data/temps/SSB0005/SSB00050002.spec.pt|vits_data/berts/SSB0005/SSB00050002.npy|sil zh uen1 zh ong4 k e1 x ve2 g uei1 l v4 d e5 ^ iao1 q iou2 sil 280 | 0|vits_data/waves-16k/SSB0005/SSB00050004.wav|vits_data/temps/SSB0005/SSB00050004.spec.pt|vits_data/berts/SSB0005/SSB00050004.npy|sil h ei1 k e4 x van1 b u4 zh iii3 ^ iao4 b o1 d a2 m ou3 ^ i2 g e4 d ian4 h ua4 sil 281 | ``` 282 | ## 数据调试 283 | ``` 284 | python prep_debug.py 285 | ``` 286 | 287 | ## 启动训练 288 | 289 | ``` 290 | cd monotonic_align 291 | 292 | python setup.py build_ext --inplace 293 | 294 | cd - 295 | 296 | python train.py -c configs/bert_vits.json -m bert_vits 297 | ``` 298 | 299 | ## 下载权重 300 | AISHELL3_G.pth:https://github.com/PlayVoice/vits_chinese/releases/v4.0 301 | 302 | ## 推理测试 303 | ``` 304 | python vits_infer.py -c configs/bert_vits.json -m AISHELL3_G.pth -i 0 305 | ``` 306 | -i 为发音人序号,取值范围:0 ~ 173 307 | 308 | **AISHELL3训练数据都是短短的一句话,所以,推理语句中不能有标点** 309 | 310 | ## 训练的AISHELL3模型,使用小米K2社区开源的AISHELL3模型来初始化训练权重,以节约训练时间 311 | 312 | K2开源模型 https://huggingface.co/jackyqs/vits-aishell3-175-chinese/tree/main 下载模型 313 | 314 | K2在线试用 https://huggingface.co/spaces/k2-fsa/text-to-speech 315 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | from models import SynthesizerTrn 2 | from vits_pinyin import VITS_PinYin 3 | from text import cleaned_text_to_sequence 4 | from text.symbols import symbols 5 | import gradio as gr 6 | import utils 7 | import torch 8 | import argparse 9 | import os 10 | import re 11 | import logging 12 | 13 | logging.getLogger('numba').setLevel(logging.WARNING) 14 | limitation = os.getenv("SYSTEM") == "spaces" 15 | 16 | 17 | def create_calback(net_g: SynthesizerTrn, tts_front: VITS_PinYin): 18 | def tts_calback(text, dur_scale): 19 | if limitation: 20 | text_len = len(re.sub("\[([A-Z]{2})\]", "", text)) 21 | max_len = 150 22 | if text_len > max_len: 23 | return "Error: Text is too long", None 24 | 25 | phonemes, char_embeds = tts_front.chinese_to_phonemes(text) 26 | input_ids = cleaned_text_to_sequence(phonemes) 27 | with torch.no_grad(): 28 | x_tst = torch.LongTensor(input_ids).unsqueeze(0).to(device) 29 | x_tst_lengths = torch.LongTensor([len(input_ids)]).to(device) 30 | x_tst_prosody = torch.FloatTensor( 31 | char_embeds).unsqueeze(0).to(device) 32 | audio = net_g.infer(x_tst, x_tst_lengths, x_tst_prosody, noise_scale=0.5, 33 | length_scale=dur_scale)[0][0, 0].data.cpu().float().numpy() 34 | del x_tst, x_tst_lengths, x_tst_prosody 35 | return "Success", (16000, audio) 36 | 37 | return tts_calback 38 | 39 | 40 | example = [['天空呈现的透心的蓝,像极了当年。总在这样的时候,透过窗棂,心,在天空里无尽的游弋!柔柔的,浓浓的,痴痴的风,牵引起心底灵动的思潮;情愫悠悠,思情绵绵,风里默坐,红尘中的浅醉,诗词中的优柔,任那自在飞花轻似梦的情怀,裁一束霓衣,织就清浅淡薄的安寂。', 1], 41 | ['风的影子翻阅过淡蓝色的信笺,柔和的文字浅浅地漫过我安静的眸,一如几朵悠闲的云儿,忽而氤氲成汽,忽而修饰成花,铅华洗尽后的透彻和靓丽,爽爽朗朗,轻轻盈盈', 1], 42 | ['时光仿佛有穿越到了从前,在你诗情画意的眼波中,在你舒适浪漫的暇思里,我如风中的思绪徜徉广阔天际,仿佛一片沾染了快乐的羽毛,在云环影绕颤动里浸润着风的呼吸,风的诗韵,那清新的耳语,那婉约的甜蜜,那恬淡的温馨,将一腔情澜染得愈发的缠绵。', 1],] 43 | 44 | 45 | if __name__ == "__main__": 46 | parser = argparse.ArgumentParser() 47 | parser.add_argument("--share", action="store_true", 48 | default=False, help="share gradio app") 49 | args = parser.parse_args() 50 | 51 | device = torch.device("cpu") 52 | 53 | # pinyin 54 | tts_front = VITS_PinYin("./bert", device) 55 | 56 | # config 57 | hps = utils.get_hparams_from_file("./configs/bert_vits.json") 58 | 59 | # model 60 | net_g = SynthesizerTrn( 61 | len(symbols), 62 | hps.data.filter_length // 2 + 1, 63 | hps.train.segment_size // hps.data.hop_length, 64 | **hps.model) 65 | 66 | model_path = "vits_bert_model.pth" 67 | utils.load_model(model_path, net_g) 68 | net_g.eval() 69 | net_g.to(device) 70 | 71 | tts_calback = create_calback(net_g, tts_front) 72 | 73 | app = gr.Blocks() 74 | with app: 75 | gr.Markdown("# Best TTS based on BERT and VITS with some Natural Speech Features Of Microsoft\n\n" 76 | "code : github.com/PlayVoice/vits_chinese\n\n" 77 | "1, Hidden prosody embedding from BERT,get natural pauses in grammar\n\n" 78 | "2, Infer loss from NaturalSpeech,get less sound error\n\n" 79 | "3, Framework of VITS,get high audio quality\n\n" 80 | "