├── requirements.txt ├── sovits4_for_colab.ipynb ├── LICENSE ├── README_zh_CN.md └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | # python=3.8 2 | 3 | fastapi==0.84.0 4 | ffmpeg-python==0.2.0 5 | Flask==2.1.2 6 | Flask_Cors==3.0.10 7 | gradio==3.41.2 8 | numpy==1.23.5 9 | pyworld==0.3.0 10 | scipy==1.10.0 11 | SoundFile==0.12.1 12 | torch==2.0.1 13 | torchaudio==2.0.2 14 | torchcrepe==0.0.23 15 | tqdm==4.63.0 16 | rich==13.7.1 17 | loguru==0.7.2 18 | scikit-maad==1.3.12 19 | praat-parselmouth==0.4.4 20 | onnx==1.16.2 21 | onnxsim== 0.4.36 22 | onnxoptimizer==0.3.13 23 | fairseq==0.12.2 24 | librosa==0.9.1 25 | tensorboard==2.12.0 26 | tensorboardX==2.6.2.2 27 | transformers==4.44.2 28 | edge_tts==6.1.12 29 | langdetect==1.0.9 30 | pydantic==1.10.12 31 | pyyaml==6.0.2 32 | pynvml==11.5.3 33 | faiss-cpu==1.8.0.post1 34 | einops==0.8.0 35 | local_attention==1.9.15 -------------------------------------------------------------------------------- /sovits4_for_colab.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "2q0l56aFQhAM" 7 | }, 8 | "source": [ 9 | "# Terms of Use\n", 10 | "\n", 11 | "### Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer, svc develop team, have nothing to do with the consequences!\n", 12 | "\n", 13 | "1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.\n", 14 | "2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.\n", 15 | "3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.\n", 16 | "4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.\n", 17 | "5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).\n", 18 | "6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "M_RcDbVPhivj" 25 | }, 26 | "source": [ 27 | "## **Note:**\n", 28 | "### **Make sure there is no a directory named `sovits4data` in your google drive at the first time you use this notebook.**\n", 29 | "### **It will be created to store some necessary files.**\n", 30 | "### **For sure you can change it to another directory by modifying `sovits_data_dir` variable.**" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": { 36 | "id": "fHaw6hGEa_Nk" 37 | }, 38 | "source": [ 39 | "# **Initialize environment**" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "colab": { 47 | "base_uri": "https://localhost:8080/" 48 | }, 49 | "executionInfo": { 50 | "elapsed": 380, 51 | "status": "ok", 52 | "timestamp": 1739450150037, 53 | "user": { 54 | "displayName": "Driver Old (Sucial丶)", 55 | "userId": "17161741792330272503" 56 | }, 57 | "user_tz": -480 58 | }, 59 | "id": "0gQcIZ8RsOkn", 60 | "outputId": "9bb3b208-e3ad-4dd7-a80d-3347039944af" 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "#@title Connect to colab runtime and check GPU\n", 65 | "#@markdown # Connect to colab runtime and check GPU\n", 66 | "#@markdown\n", 67 | "!nvidia-smi" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "id": "b1K5WrIIQwad" 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "#@title Clone repository and install requirements\n", 79 | "#@markdown # Clone repository and install requirements\n", 80 | "#@markdown\n", 81 | "#@markdown ### After the execution is completed, the runtime will **automatically restart**\n", 82 | "#@markdown\n", 83 | "\n", 84 | "!sudo apt-get update -y\n", 85 | "!sudo apt-get install python3.8\n", 86 | "!sudo apt-get install python3.8-distutils\n", 87 | "!sudo apt-get install python3.8-venv python3.8-dev\n", 88 | "!wget https://bootstrap.pypa.io/get-pip.py\n", 89 | "!python3.8 get-pip.py\n", 90 | "!python3.8 --version\n", 91 | "\n", 92 | "!git clone https://github.com/svc-develop-team/so-vits-svc -b 4.1-Stable\n", 93 | "!git clone https://github.com/SUC-DriverOld/so-vits-svc-Deployment-Documents\n", 94 | "!cp /content/so-vits-svc-Deployment-Documents/requirements.txt /content/so-vits-svc\n", 95 | "\n", 96 | "%cd /content/so-vits-svc\n", 97 | "!python3.8 -m pip install --upgrade pip==24.0 setuptools\n", 98 | "!python3.8 -m pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "colab": { 106 | "base_uri": "https://localhost:8080/" 107 | }, 108 | "executionInfo": { 109 | "elapsed": 21539, 110 | "status": "ok", 111 | "timestamp": 1713783822861, 112 | "user": { 113 | "displayName": "Driver Old", 114 | "userId": "17161741792330272503" 115 | }, 116 | "user_tz": -480 117 | }, 118 | "id": "wmUkpUmfn_Hs", 119 | "outputId": "d344f369-1016-4585-e1de-af7cf5b11154" 120 | }, 121 | "outputs": [ 122 | { 123 | "name": "stdout", 124 | "output_type": "stream", 125 | "text": [ 126 | "Mounted at /content/drive\n" 127 | ] 128 | } 129 | ], 130 | "source": [ 131 | "#@title Mount google drive and select which directories to sync with google drive\n", 132 | "#@markdown # Mount google drive and select which directories to sync with google drive\n", 133 | "#@markdown\n", 134 | "\n", 135 | "from google.colab import drive\n", 136 | "drive.mount(\"/content/drive\")\n", 137 | "\n", 138 | "#@markdown Directory to store **necessary files**, dont miss the slash at the end👇.\n", 139 | "sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n", 140 | "#@markdown By default it will create a `sovits4data/` folder in your google drive.\n", 141 | "RAW_DIR = sovits_data_dir + \"raw/\"\n", 142 | "RESULTS_DIR = sovits_data_dir + \"results/\"\n", 143 | "FILELISTS_DIR = sovits_data_dir + \"filelists/\"\n", 144 | "CONFIGS_DIR = sovits_data_dir + \"configs/\"\n", 145 | "LOGS_DIR = sovits_data_dir + \"logs/44k/\"\n", 146 | "\n", 147 | "#@markdown\n", 148 | "#@markdown ### These folders will be synced with your google drvie\n", 149 | "#@markdown ### **Strongly recommend to check all.**\n", 150 | "#@markdown Sync **input audios** and **output audios**\n", 151 | "sync_raw_and_results = True #@param {type:\"boolean\"}\n", 152 | "if sync_raw_and_results:\n", 153 | " !mkdir -p {RAW_DIR}\n", 154 | " !mkdir -p {RESULTS_DIR}\n", 155 | " !rm -rf /content/so-vits-svc/raw\n", 156 | " !rm -rf /content/so-vits-svc/results\n", 157 | " !ln -s {RAW_DIR} /content/so-vits-svc/raw\n", 158 | " !ln -s {RESULTS_DIR} /content/so-vits-svc/results\n", 159 | "\n", 160 | "#@markdown Sync **config** and **models**\n", 161 | "sync_configs_and_logs = True #@param {type:\"boolean\"}\n", 162 | "if sync_configs_and_logs:\n", 163 | " !mkdir -p {FILELISTS_DIR}\n", 164 | " !mkdir -p {CONFIGS_DIR}\n", 165 | " !mkdir -p {LOGS_DIR}\n", 166 | " !rm -rf /content/so-vits-svc/filelists\n", 167 | " !rm -rf /content/so-vits-svc/configs\n", 168 | " !rm -rf /content/so-vits-svc/logs/44k\n", 169 | " !ln -s {FILELISTS_DIR} /content/so-vits-svc/filelists\n", 170 | " !ln -s {CONFIGS_DIR} /content/so-vits-svc/configs\n", 171 | " !ln -s {LOGS_DIR} /content/so-vits-svc/logs/44k" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": { 178 | "colab": { 179 | "base_uri": "https://localhost:8080/" 180 | }, 181 | "executionInfo": { 182 | "elapsed": 2873, 183 | "status": "ok", 184 | "timestamp": 1739517582450, 185 | "user": { 186 | "displayName": "Driver Old (Sucial丶)", 187 | "userId": "17161741792330272503" 188 | }, 189 | "user_tz": -480 190 | }, 191 | "id": "G_PMPCN6wvgZ", 192 | "outputId": "c4c04025-4cda-4c3f-df63-fbd87f259627" 193 | }, 194 | "outputs": [], 195 | "source": [ 196 | "#@title Get pretrained model(Optional but strongly recommend).\n", 197 | "#@markdown # Get pretrained model(Optional but strongly recommend).\n", 198 | "#@markdown\n", 199 | "\n", 200 | "pretrain_model_types = \"vec768l12\" #@param [\"vec768l12_baicai_20250211\", \"vec768l12_vol_emb\", \"vec256l9_no_diffusion\", \"whisper-ppg\", \"hubertsoft\", \"tiny_vec768l12_vol_emb\"]\n", 201 | "\n", 202 | "if pretrain_model_types == \"vec768l12\":\n", 203 | " D_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec768l12/D_0.pth\"\n", 204 | " G_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec768l12/G_0.pth\"\n", 205 | " diff_model_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/diffusion/768l12/model_0.pt\"\n", 206 | "elif pretrain_model_types == \"vec768l12_baicai_20250211\":\n", 207 | " D_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec768l12/baicai_20250211/D_0.pth\"\n", 208 | " G_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec768l12/baicai_20250211/G_0.pth\"\n", 209 | " diff_model_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/diffusion/768l12/baicai_20250211/model_0.pt\"\n", 210 | "elif pretrain_model_types == \"vec768l12_vol_emb\":\n", 211 | " D_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec768l12/vol_emb/D_0.pth\"\n", 212 | " G_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec768l12/vol_emb/G_0.pth\"\n", 213 | " diff_model_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/diffusion/768l12/model_0.pt\"\n", 214 | "elif pretrain_model_types == \"vec256l9_no_diffusion\":\n", 215 | " D_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec256l9/D_0.pth\"\n", 216 | " G_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/vec256l9/G_0.pth\"\n", 217 | " diff_model_URL = \"\"\n", 218 | "elif pretrain_model_types == \"whisper-ppg\":\n", 219 | " D_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/whisper-ppg/D_0.pth\"\n", 220 | " G_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/whisper-ppg/G_0.pth\"\n", 221 | " diff_model_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/diffusion/whisper-ppg/model_0.pt\"\n", 222 | "elif pretrain_model_types == \"hubertsoft\":\n", 223 | " D_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/hubertsoft/D_0.pth\"\n", 224 | " G_0_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/hubertsoft/G_0.pth\"\n", 225 | " diff_model_URL = \"https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/resolve/main/diffusion/hubertsoft/model_0.pt\"\n", 226 | "\n", 227 | "%cd /content/so-vits-svc\n", 228 | "\n", 229 | "if D_0_URL != \"\" and G_0_URL != \"\":\n", 230 | " !curl -L {D_0_URL} -o logs/44k/D_0.pth\n", 231 | " !curl -L {G_0_URL} -o logs/44k/G_0.pth\n", 232 | "\n", 233 | "if diff_model_URL != \"\":\n", 234 | " !mkdir -p logs/44k/diffusion\n", 235 | " !curl -L {diff_model_URL} -o logs/44k/diffusion/model_0.pt" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": { 241 | "id": "k1qadJBFehMo" 242 | }, 243 | "source": [ 244 | "# **Dataset preprocessing**" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": { 250 | "id": "kBlju6Q3lSM6" 251 | }, 252 | "source": [ 253 | "Pack and upload your raw dataset(dataset_raw/) to your google drive.\n", 254 | "\n", 255 | "Makesure the file structure in your zip file looks like this:\n", 256 | "\n", 257 | "```\n", 258 | "YourZIPforSingleSpeakers.zip\n", 259 | "└───speaker\n", 260 | " ├───xxx1-xxx1.wav\n", 261 | " ├───...\n", 262 | " └───Lxx-0xx8.wav\n", 263 | "```\n", 264 | "\n", 265 | "```\n", 266 | "YourZIPforMultipleSpeakers.zip\n", 267 | "├───speaker0\n", 268 | "│ ├───xxx1-xxx1.wav\n", 269 | "│ ├───...\n", 270 | "│ └───Lxx-0xx8.wav\n", 271 | "└───speaker1\n", 272 | " ├───xx2-0xxx2.wav\n", 273 | " ├───...\n", 274 | " └───xxx7-xxx007.wav\n", 275 | "```\n", 276 | "\n", 277 | "**Even if there is only one speaker, a folder named `{speaker_name}` is needed.**" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": { 284 | "id": "U05CXlAipvJR" 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "#@title Get raw dataset from google drive\n", 289 | "#@markdown # Get raw dataset from google drive\n", 290 | "#@markdown\n", 291 | "\n", 292 | "#@markdown Directory where **your zip file** located in, dont miss the slash at the end👇.\n", 293 | "sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n", 294 | "#@markdown Filename of **your zip file**, do NOT be \"dataset.zip\"\n", 295 | "zip_filename = \"YOUR_ZIP_NAME.zip\" #@param {type:\"string\"}\n", 296 | "ZIP_PATH = sovits_data_dir + zip_filename\n", 297 | "\n", 298 | "!unzip -od /content/so-vits-svc/dataset_raw {ZIP_PATH}" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": { 305 | "colab": { 306 | "base_uri": "https://localhost:8080/" 307 | }, 308 | "executionInfo": { 309 | "elapsed": 67969, 310 | "status": "ok", 311 | "timestamp": 1713780710567, 312 | "user": { 313 | "displayName": "Driver Old", 314 | "userId": "17161741792330272503" 315 | }, 316 | "user_tz": -480 317 | }, 318 | "id": "_ThKTzYs5CfL", 319 | "outputId": "0ee853d3-ba69-43a6-c3f5-ffd6faebcaed" 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "#@title Resample to 44100Hz and mono\n", 324 | "#@markdown # Resample to 44100Hz and mono\n", 325 | "#@markdown\n", 326 | "\n", 327 | "%cd /content/so-vits-svc\n", 328 | "!python3.8 resample.py" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": { 335 | "colab": { 336 | "base_uri": "https://localhost:8080/" 337 | }, 338 | "executionInfo": { 339 | "elapsed": 85422, 340 | "status": "ok", 341 | "timestamp": 1713780904587, 342 | "user": { 343 | "displayName": "Driver Old", 344 | "userId": "17161741792330272503" 345 | }, 346 | "user_tz": -480 347 | }, 348 | "id": "svITReeL5N8K", 349 | "outputId": "c553067b-0b3c-46ea-eabb-0b7b48b64d1e" 350 | }, 351 | "outputs": [], 352 | "source": [ 353 | "#@title Divide filelists and generate config.json\n", 354 | "#@markdown # Divide filelists and generate config.json\n", 355 | "#@markdown\n", 356 | "\n", 357 | "%cd /content/so-vits-svc\n", 358 | "\n", 359 | "speech_encoder = \"vec768l12\" #@param [\"vec768l12\", \"vec256l9\", \"hubertsoft\", \"whisper-ppg\", \"whisper-ppg-large\"]\n", 360 | "use_vol_aug = False #@param {type:\"boolean\"}\n", 361 | "vol_aug = \"--vol_aug\" if use_vol_aug else \"\"\n", 362 | "\n", 363 | "from pretrain.meta import download_dict\n", 364 | "download_dict = download_dict()\n", 365 | "\n", 366 | "url = download_dict[speech_encoder][\"url\"]\n", 367 | "output = download_dict[speech_encoder][\"output\"]\n", 368 | "\n", 369 | "import os\n", 370 | "if not os.path.exists(output):\n", 371 | " !curl -L {url} -o {output}\n", 372 | " !md5sum {output}\n", 373 | "\n", 374 | "!python3.8 preprocess_flist_config.py --speech_encoder={speech_encoder} {vol_aug}" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "id": "xHUXMi836DMe" 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "#@title Generate hubert and f0\n", 386 | "#@markdown # Generate hubert and f0\n", 387 | "#@markdown\n", 388 | "%cd /content/so-vits-svc\n", 389 | "\n", 390 | "f0_predictor = \"rmvpe\" #@param [\"crepe\", \"pm\", \"dio\", \"harvest\", \"rmvpe\", \"fcpe\"]\n", 391 | "use_diff = True #@param {type:\"boolean\"}\n", 392 | "\n", 393 | "import os\n", 394 | "if f0_predictor == \"rmvpe\" and not os.path.exists(\"./pretrain/rmvpe.pt\"):\n", 395 | " !curl -L https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/rmvpe.pt -o pretrain/rmvpe.pt\n", 396 | "\n", 397 | "if f0_predictor == \"fcpe\" and not os.path.exists(\"./pretrain/fcpe.pt\"):\n", 398 | " !curl -L https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/fcpe.pt -o pretrain/fcpe.pt\n", 399 | "\n", 400 | "\n", 401 | "diff_param = \"\"\n", 402 | "if use_diff:\n", 403 | " diff_param = \"--use_diff\"\n", 404 | "\n", 405 | " if not os.path.exists(\"./pretrain/nsf_hifigan/model\"):\n", 406 | " !curl -L https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip -o nsf_hifigan_20221211.zip\n", 407 | " !md5sum nsf_hifigan_20221211.zip\n", 408 | " !unzip nsf_hifigan_20221211.zip\n", 409 | " !rm -rf pretrain/nsf_hifigan\n", 410 | " !mv -v nsf_hifigan pretrain\n", 411 | "\n", 412 | "!python3.8 preprocess_hubert_f0.py --f0_predictor={f0_predictor} {diff_param}" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": { 419 | "id": "Wo4OTmTAUXgj" 420 | }, 421 | "outputs": [], 422 | "source": [ 423 | "#@title Save the preprocessed dataset to google drive\n", 424 | "#@markdown # Save the preprocessed dataset to google drive\n", 425 | "#@markdown\n", 426 | "\n", 427 | "#@markdown You can save the dataset and related files to your google drive for the next training\n", 428 | "#@markdown **Directory for saving**, dont miss the slash at the end👇.\n", 429 | "sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n", 430 | "\n", 431 | "#@markdown There will be a `dataset.zip` contained `dataset/` in your google drive, which is preprocessed data.\n", 432 | "!mkdir -p {sovits_data_dir}\n", 433 | "!zip -r dataset.zip /content/so-vits-svc/dataset\n", 434 | "!cp -vr dataset.zip \"{sovits_data_dir}\"" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": { 441 | "id": "P2G6v_6zblWK" 442 | }, 443 | "outputs": [], 444 | "source": [ 445 | "#@title Unzip preprocessed dataset from google drive directly if you have preprocessed already.\n", 446 | "#@markdown # Unzip preprocessed dataset from google drive directly if you have preprocessed already.\n", 447 | "#@markdown\n", 448 | "\n", 449 | "#@markdown Directory where **your preprocessed dataset** located in, dont miss the slash at the end👇.\n", 450 | "sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n", 451 | "CONFIG = sovits_data_dir + \"configs/\"\n", 452 | "FILELISTS = sovits_data_dir + \"filelists/\"\n", 453 | "DATASET = sovits_data_dir + \"dataset.zip\"\n", 454 | "\n", 455 | "!cp -vr {CONFIG} /content/so-vits-svc/\n", 456 | "!cp -vr {FILELISTS} /content/so-vits-svc/\n", 457 | "!unzip {DATASET} -d /" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": { 463 | "id": "ENoH-pShel7w" 464 | }, 465 | "source": [ 466 | "# **Trainning**" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": { 473 | "id": "-hEFFTCfZf57" 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "#@title Start training\n", 478 | "#@markdown # Start training\n", 479 | "#@markdown If you want to use pre-trained models, upload them to /sovits4data/logs/44k/ in your google drive manualy.\n", 480 | "#@markdown\n", 481 | "\n", 482 | "%cd /content/so-vits-svc\n", 483 | "\n", 484 | "#@markdown Whether to enable tensorboard\n", 485 | "tensorboard_on = True #@param {type:\"boolean\"}\n", 486 | "\n", 487 | "if tensorboard_on:\n", 488 | " %load_ext tensorboard\n", 489 | " %tensorboard --logdir logs/44k\n", 490 | "\n", 491 | "config_path = \"configs/config.json\"\n", 492 | "\n", 493 | "from pretrain.meta import get_speech_encoder\n", 494 | "url, output = get_speech_encoder(config_path)\n", 495 | "\n", 496 | "import os\n", 497 | "if not os.path.exists(output):\n", 498 | " !curl -L {url} -o {output}\n", 499 | "\n", 500 | "!python3.8 train.py -c {config_path} -m 44k" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "metadata": { 507 | "colab": { 508 | "base_uri": "https://localhost:8080/" 509 | }, 510 | "executionInfo": { 511 | "elapsed": 336290, 512 | "status": "ok", 513 | "timestamp": 1713783082098, 514 | "user": { 515 | "displayName": "Driver Old", 516 | "userId": "17161741792330272503" 517 | }, 518 | "user_tz": -480 519 | }, 520 | "id": "ZThaMxmIJgWy", 521 | "outputId": "81d61ece-3971-4c8c-a7db-8423502fc209" 522 | }, 523 | "outputs": [], 524 | "source": [ 525 | "#@title Train cluster model (Optional)\n", 526 | "#@markdown # Train cluster model (Optional)\n", 527 | "#@markdown #### Details see [README.md#cluster-based-timbre-leakage-control](https://github.com/svc-develop-team/so-vits-svc#cluster-based-timbre-leakage-control)\n", 528 | "#@markdown\n", 529 | "\n", 530 | "%cd /content/so-vits-svc\n", 531 | "!python3.8 cluster/train_cluster.py --gpu" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": { 538 | "colab": { 539 | "base_uri": "https://localhost:8080/" 540 | }, 541 | "executionInfo": { 542 | "elapsed": 348, 543 | "status": "ok", 544 | "timestamp": 1713783300837, 545 | "user": { 546 | "displayName": "Driver Old", 547 | "userId": "17161741792330272503" 548 | }, 549 | "user_tz": -480 550 | }, 551 | "id": "NbTCi7GwHTnZ", 552 | "outputId": "1cbc380b-344c-4dba-dc17-918848be51c0" 553 | }, 554 | "outputs": [], 555 | "source": [ 556 | "#@title Train index model (Optional)\n", 557 | "#@markdown # Train index model (Optional)\n", 558 | "#@markdown #### Details see [README.md#feature-retrieval](https://github.com/svc-develop-team/so-vits-svc#feature-retrieval)\n", 559 | "#@markdown\n", 560 | "\n", 561 | "%cd /content/so-vits-svc\n", 562 | "!python3.8 train_index.py -c configs/config.json" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": { 569 | "id": "ulSzahztHTnZ" 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "#@title Train diffusion model (Optional)\n", 574 | "#@markdown # Train diffusion model (Optional)\n", 575 | "#@markdown #### Details see [README.md#-about-shallow-diffusion](https://github.com/svc-develop-team/so-vits-svc#-about-shallow-diffusion)\n", 576 | "#@markdown\n", 577 | "\n", 578 | "%cd /content/so-vits-svc\n", 579 | "\n", 580 | "import os\n", 581 | "if not os.path.exists(\"./pretrain/nsf_hifigan/model\"):\n", 582 | " !curl -L https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip -o nsf_hifigan_20221211.zip\n", 583 | " !unzip nsf_hifigan_20221211.zip\n", 584 | " !rm -rf pretrain/nsf_hifigan\n", 585 | " !mv -v nsf_hifigan pretrain\n", 586 | "\n", 587 | "#@markdown Whether to enable tensorboard\n", 588 | "tensorboard_on = True #@param {type:\"boolean\"}\n", 589 | "\n", 590 | "if tensorboard_on:\n", 591 | " %load_ext tensorboard\n", 592 | " %tensorboard --logdir logs/44k\n", 593 | "\n", 594 | "!python3.8 train_diff.py -c configs/diffusion.yaml" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": { 600 | "id": "lp33AktGHTnZ" 601 | }, 602 | "source": [ 603 | "# keep colab alive\n", 604 | "Open the devtools and copy & paste to run the scrips.\n", 605 | "\n", 606 | "\n", 607 | "```JavaScript\n", 608 | "const ping = () => {\n", 609 | " const btn = document.querySelector(\"colab-connect-button\");\n", 610 | " const inner_btn = btn.shadowRoot.querySelector(\"#connect\");\n", 611 | " if (inner_btn) {\n", 612 | " inner_btn.click();\n", 613 | " console.log(\"Clicked on connect button\");\n", 614 | " } else {\n", 615 | " console.log(\"connect button not found\");\n", 616 | " }\n", 617 | " const nextTime = 50000 + Math.random() * 10000;\n", 618 | " setTimeout(ping, nextTime);\n", 619 | "};\n", 620 | "ping();\n", 621 | "```" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": { 627 | "id": "oCnbX-OT897k" 628 | }, 629 | "source": [ 630 | "# **Inference**\n", 631 | "\n", 632 | "- Upload your raw audio files to `/content/so-vits-svc/raw`\n", 633 | "- Download result audio files from `/content/so-vits-svc/results`\n", 634 | "- Put your sovits model in `/content/so-vits-svc/logs/44k/YOUR_MODEL.pth`\n", 635 | "- Put yuor diffusion model in `/content/so-vits-svc/logs/44k/diffusion/YOUR_MODEL.pt`\n", 636 | "- If you need to use kmeans, put `kmeans_10000.pt` in `/content/so-vits-svc/logs/44k/kmeans_10000.pt`\n", 637 | "- Put your sovits config file `config.json` and diffusion config file (if you use diffusion model)`diffusion.yaml` in `/content/so-vits-svc/configs`" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": { 644 | "colab": { 645 | "base_uri": "https://localhost:8080/" 646 | }, 647 | "id": "dYnKuKTIj3z1", 648 | "outputId": "1d5e9cf7-7a15-4554-b8f9-53a18c3c785f" 649 | }, 650 | "outputs": [], 651 | "source": [ 652 | "#@title Start inference (and download)\n", 653 | "#@markdown # Start inference (and download)\n", 654 | "#@markdown Parameters see [README.MD#Inference](https://github.com/svc-develop-team/so-vits-svc#-inference) \\\n", 655 | "#@markdown\n", 656 | "\n", 657 | "#@markdown\n", 658 | "\n", 659 | "#@markdown File and model Parameters\n", 660 | "wav_filename = \"YOUR_AUDIO_NAME.wav\" #@param {type:\"string\"}\n", 661 | "model_filename = \"YOUR_MODEL_NAME.pth\" #@param {type:\"string\"}\n", 662 | "config_filename = \"config.json\" #@param {type:\"string\"}\n", 663 | "speaker = \"YOUR_SPEARER_NAME\" #@param {type:\"string\"}\n", 664 | "use_diffusion_model = True #@param {type:\"boolean\"}\n", 665 | "diffusion_model_name = \"YOUR_DIFFUSION_MODEL_NAME.pt\" #@param {type:\"string\"}\n", 666 | "diffusion_config_filename = \"diffusion.yaml\" #@param {type:\"string\"}\n", 667 | "use_kmeans_model = False #@param {type:\"boolean\"}\n", 668 | "kmeans_filenname = \"kmeans_10000.pt\" #@param {type:\"string\"}\n", 669 | "\n", 670 | "#@markdown\n", 671 | "\n", 672 | "#@markdown Common Parameters\n", 673 | "trans = \"0\" #@param {type:\"string\"}\n", 674 | "force_clip_seconds = \"0\" #@param {type:\"string\"}\n", 675 | "f0_predictor = \"rmvpe\" #@param [\"crepe\", \"pm\", \"dio\", \"harvest\", \"rmvpe\", \"fcpe\"]\n", 676 | "auto_predict_f0 = False #@param {type:\"boolean\"}\n", 677 | "enhance = False #@param {type:\"boolean\"}\n", 678 | "\n", 679 | "#@markdown\n", 680 | "\n", 681 | "#@markdown Diffusion Parameters\n", 682 | "diffusion_k_step = \"20\" #@param {type:\"string\"}\n", 683 | "second_encoding = False #@param {type:\"boolean\"}\n", 684 | "only_diffusion = False #@param {type:\"boolean\"}\n", 685 | "\n", 686 | "#@markdown\n", 687 | "\n", 688 | "#@markdown Other Parameters\n", 689 | "cluster_infer_ratio = \"0\" #@param {type:\"string\"}\n", 690 | "slice_db = \"-40\" #@param {type:\"string\"}\n", 691 | "wav_format = \"wav\" #@param [\"wav\", \"flac\", \"mp3\"]\n", 692 | "\n", 693 | "\n", 694 | "model_path = \"/content/so-vits-svc/logs/44k/\" + model_filename\n", 695 | "config_path = \"/content/so-vits-svc/configs/\" + config_filename\n", 696 | "diffusion_model_path = \"/content/so-vits-svc/logs/44k/diffusion/\" + diffusion_model_name\n", 697 | "diffusion_config_path = \"/content/so-vits-svc/configs/\" + diffusion_config_filename\n", 698 | "kmeans_path = \"/content/so-vits-svc/logs/44k/\" + kmeans_filenname\n", 699 | "\n", 700 | "common_param = f\" --trans {trans} --clip {force_clip_seconds} --f0_predictor {f0_predictor}\"\n", 701 | "if auto_predict_f0:\n", 702 | " common_param += \" --auto_predict_f0\"\n", 703 | "if enhance:\n", 704 | " common_param += \" --enhance\"\n", 705 | "\n", 706 | "diffusion_param = \"\"\n", 707 | "if use_diffusion_model:\n", 708 | " diffusion_param = \" --shallow_diffusion\"\n", 709 | " diffusion_param += f\" --diffusion_model_path \\\"{diffusion_model_path}\\\"\"\n", 710 | " diffusion_param += f\" --diffusion_config_path \\\"{diffusion_config_path}\\\"\"\n", 711 | " diffusion_param += f\" --k_step {diffusion_k_step}\"\n", 712 | " if second_encoding:\n", 713 | " diffusion_param += \" --second_encoding\"\n", 714 | " if only_diffusion:\n", 715 | " diffusion_param += \" --only_diffusion\"\n", 716 | "\n", 717 | "kmeans_param = \"\"\n", 718 | "if use_kmeans_model:\n", 719 | " kmeans_param = f\" --cluster_model_path \\\"{kmeans_path}\\\"\"\n", 720 | " kmeans_param += f\" --cluster_infer_ratio {cluster_infer_ratio}\"\n", 721 | "\n", 722 | "basic_param = f\"-n \\\"{wav_filename}\\\" -m \\\"{model_path}\\\" -s {speaker} -c \\\"{config_path}\\\"\"\n", 723 | "other_param = f\" --slice_db {slice_db} --wav_format {wav_format}\"\n", 724 | "param = basic_param + common_param + diffusion_param + kmeans_param + other_param\n", 725 | "\n", 726 | "\n", 727 | "from pretrain.meta import get_speech_encoder\n", 728 | "url, output = get_speech_encoder(config_path)\n", 729 | "import os\n", 730 | "\n", 731 | "if not os.path.exists(\"./pretrain/nsf_hifigan/model\"):\n", 732 | " !curl -L https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip -o /content/so-vits-svc/nsf_hifigan_20221211.zip\n", 733 | " !unzip nsf_hifigan_20221211.zip\n", 734 | " !rm -rf pretrain/nsf_hifigan\n", 735 | " !mv -v nsf_hifigan pretrain\n", 736 | "\n", 737 | "if f0_predictor == \"rmvpe\" and not os.path.exists(\"./pretrain/rmvpe.pt\"):\n", 738 | " !curl -L https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/rmvpe.pt -o pretrain/rmvpe.pt\n", 739 | "if f0_predictor == \"fcpe\" and not os.path.exists(\"./pretrain/fcpe.pt\"):\n", 740 | " !curl -L https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/fcpe.pt -o pretrain/fcpe.pt\n", 741 | "if not os.path.exists(output):\n", 742 | " !curl -L {url} -o {output}\n", 743 | "\n", 744 | "\n", 745 | "%cd /content/so-vits-svc\n", 746 | "print(f\"python3.8 inference_main.py {param}\")\n", 747 | "!python3.8 inference_main.py {param}" 748 | ] 749 | } 750 | ], 751 | "metadata": { 752 | "colab": { 753 | "provenance": [ 754 | { 755 | "file_id": "19fxpo-ZoL_ShEUeZIZi6Di-YioWrEyhR", 756 | "timestamp": 1678516497580 757 | }, 758 | { 759 | "file_id": "1rCUOOVG7-XQlVZuWRAj5IpGrMM8t07pE", 760 | "timestamp": 1673086970071 761 | }, 762 | { 763 | "file_id": "1Ul5SmzWiSHBj0MaKA0B682C-RZKOycwF", 764 | "timestamp": 1670483515921 765 | } 766 | ] 767 | }, 768 | "kernelspec": { 769 | "display_name": "Python 3", 770 | "name": "python3" 771 | }, 772 | "language_info": { 773 | "codemirror_mode": { 774 | "name": "ipython", 775 | "version": 3 776 | }, 777 | "file_extension": ".py", 778 | "mimetype": "text/x-python", 779 | "name": "python", 780 | "nbconvert_exporter": "python", 781 | "pygments_lexer": "ipython3", 782 | "version": "3.8.16" 783 | } 784 | }, 785 | "nbformat": 4, 786 | "nbformat_minor": 0 787 | } 788 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU AFFERO GENERAL PUBLIC LICENSE 2 | Version 3, 19 November 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU Affero General Public License is a free, copyleft license for 11 | software and other kinds of works, specifically designed to ensure 12 | cooperation with the community in the case of network server software. 13 | 14 | The licenses for most software and other practical works are designed 15 | to take away your freedom to share and change the works. By contrast, 16 | our General Public Licenses are intended to guarantee your freedom to 17 | share and change all versions of a program--to make sure it remains free 18 | software for all its users. 19 | 20 | When we speak of free software, we are referring to freedom, not 21 | price. Our General Public Licenses are designed to make sure that you 22 | have the freedom to distribute copies of free software (and charge for 23 | them if you wish), that you receive source code or can get it if you 24 | want it, that you can change the software or use pieces of it in new 25 | free programs, and that you know you can do these things. 26 | 27 | Developers that use our General Public Licenses protect your rights 28 | with two steps: (1) assert copyright on the software, and (2) offer 29 | you this License which gives you legal permission to copy, distribute 30 | and/or modify the software. 31 | 32 | A secondary benefit of defending all users' freedom is that 33 | improvements made in alternate versions of the program, if they 34 | receive widespread use, become available for other developers to 35 | incorporate. Many developers of free software are heartened and 36 | encouraged by the resulting cooperation. However, in the case of 37 | software used on network servers, this result may fail to come about. 38 | The GNU General Public License permits making a modified version and 39 | letting the public access it on a server without ever releasing its 40 | source code to the public. 41 | 42 | The GNU Affero General Public License is designed specifically to 43 | ensure that, in such cases, the modified source code becomes available 44 | to the community. It requires the operator of a network server to 45 | provide the source code of the modified version running there to the 46 | users of that server. Therefore, public use of a modified version, on 47 | a publicly accessible server, gives the public access to the source 48 | code of the modified version. 49 | 50 | An older license, called the Affero General Public License and 51 | published by Affero, was designed to accomplish similar goals. This is 52 | a different license, not a version of the Affero GPL, but Affero has 53 | released a new version of the Affero GPL which permits relicensing under 54 | this license. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | TERMS AND CONDITIONS 60 | 61 | 0. Definitions. 62 | 63 | "This License" refers to version 3 of the GNU Affero General Public License. 64 | 65 | "Copyright" also means copyright-like laws that apply to other kinds of 66 | works, such as semiconductor masks. 67 | 68 | "The Program" refers to any copyrightable work licensed under this 69 | License. Each licensee is addressed as "you". "Licensees" and 70 | "recipients" may be individuals or organizations. 71 | 72 | To "modify" a work means to copy from or adapt all or part of the work 73 | in a fashion requiring copyright permission, other than the making of an 74 | exact copy. The resulting work is called a "modified version" of the 75 | earlier work or a work "based on" the earlier work. 76 | 77 | A "covered work" means either the unmodified Program or a work based 78 | on the Program. 79 | 80 | To "propagate" a work means to do anything with it that, without 81 | permission, would make you directly or secondarily liable for 82 | infringement under applicable copyright law, except executing it on a 83 | computer or modifying a private copy. Propagation includes copying, 84 | distribution (with or without modification), making available to the 85 | public, and in some countries other activities as well. 86 | 87 | To "convey" a work means any kind of propagation that enables other 88 | parties to make or receive copies. Mere interaction with a user through 89 | a computer network, with no transfer of a copy, is not conveying. 90 | 91 | An interactive user interface displays "Appropriate Legal Notices" 92 | to the extent that it includes a convenient and prominently visible 93 | feature that (1) displays an appropriate copyright notice, and (2) 94 | tells the user that there is no warranty for the work (except to the 95 | extent that warranties are provided), that licensees may convey the 96 | work under this License, and how to view a copy of this License. If 97 | the interface presents a list of user commands or options, such as a 98 | menu, a prominent item in the list meets this criterion. 99 | 100 | 1. Source Code. 101 | 102 | The "source code" for a work means the preferred form of the work 103 | for making modifications to it. "Object code" means any non-source 104 | form of a work. 105 | 106 | A "Standard Interface" means an interface that either is an official 107 | standard defined by a recognized standards body, or, in the case of 108 | interfaces specified for a particular programming language, one that 109 | is widely used among developers working in that language. 110 | 111 | The "System Libraries" of an executable work include anything, other 112 | than the work as a whole, that (a) is included in the normal form of 113 | packaging a Major Component, but which is not part of that Major 114 | Component, and (b) serves only to enable use of the work with that 115 | Major Component, or to implement a Standard Interface for which an 116 | implementation is available to the public in source code form. A 117 | "Major Component", in this context, means a major essential component 118 | (kernel, window system, and so on) of the specific operating system 119 | (if any) on which the executable work runs, or a compiler used to 120 | produce the work, or an object code interpreter used to run it. 121 | 122 | The "Corresponding Source" for a work in object code form means all 123 | the source code needed to generate, install, and (for an executable 124 | work) run the object code and to modify the work, including scripts to 125 | control those activities. However, it does not include the work's 126 | System Libraries, or general-purpose tools or generally available free 127 | programs which are used unmodified in performing those activities but 128 | which are not part of the work. For example, Corresponding Source 129 | includes interface definition files associated with source files for 130 | the work, and the source code for shared libraries and dynamically 131 | linked subprograms that the work is specifically designed to require, 132 | such as by intimate data communication or control flow between those 133 | subprograms and other parts of the work. 134 | 135 | The Corresponding Source need not include anything that users 136 | can regenerate automatically from other parts of the Corresponding 137 | Source. 138 | 139 | The Corresponding Source for a work in source code form is that 140 | same work. 141 | 142 | 2. Basic Permissions. 143 | 144 | All rights granted under this License are granted for the term of 145 | copyright on the Program, and are irrevocable provided the stated 146 | conditions are met. This License explicitly affirms your unlimited 147 | permission to run the unmodified Program. The output from running a 148 | covered work is covered by this License only if the output, given its 149 | content, constitutes a covered work. This License acknowledges your 150 | rights of fair use or other equivalent, as provided by copyright law. 151 | 152 | You may make, run and propagate covered works that you do not 153 | convey, without conditions so long as your license otherwise remains 154 | in force. You may convey covered works to others for the sole purpose 155 | of having them make modifications exclusively for you, or provide you 156 | with facilities for running those works, provided that you comply with 157 | the terms of this License in conveying all material for which you do 158 | not control copyright. Those thus making or running the covered works 159 | for you must do so exclusively on your behalf, under your direction 160 | and control, on terms that prohibit them from making any copies of 161 | your copyrighted material outside their relationship with you. 162 | 163 | Conveying under any other circumstances is permitted solely under 164 | the conditions stated below. Sublicensing is not allowed; section 10 165 | makes it unnecessary. 166 | 167 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 168 | 169 | No covered work shall be deemed part of an effective technological 170 | measure under any applicable law fulfilling obligations under article 171 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 172 | similar laws prohibiting or restricting circumvention of such 173 | measures. 174 | 175 | When you convey a covered work, you waive any legal power to forbid 176 | circumvention of technological measures to the extent such circumvention 177 | is effected by exercising rights under this License with respect to 178 | the covered work, and you disclaim any intention to limit operation or 179 | modification of the work as a means of enforcing, against the work's 180 | users, your or third parties' legal rights to forbid circumvention of 181 | technological measures. 182 | 183 | 4. Conveying Verbatim Copies. 184 | 185 | You may convey verbatim copies of the Program's source code as you 186 | receive it, in any medium, provided that you conspicuously and 187 | appropriately publish on each copy an appropriate copyright notice; 188 | keep intact all notices stating that this License and any 189 | non-permissive terms added in accord with section 7 apply to the code; 190 | keep intact all notices of the absence of any warranty; and give all 191 | recipients a copy of this License along with the Program. 192 | 193 | You may charge any price or no price for each copy that you convey, 194 | and you may offer support or warranty protection for a fee. 195 | 196 | 5. Conveying Modified Source Versions. 197 | 198 | You may convey a work based on the Program, or the modifications to 199 | produce it from the Program, in the form of source code under the 200 | terms of section 4, provided that you also meet all of these conditions: 201 | 202 | a) The work must carry prominent notices stating that you modified 203 | it, and giving a relevant date. 204 | 205 | b) The work must carry prominent notices stating that it is 206 | released under this License and any conditions added under section 207 | 7. This requirement modifies the requirement in section 4 to 208 | "keep intact all notices". 209 | 210 | c) You must license the entire work, as a whole, under this 211 | License to anyone who comes into possession of a copy. This 212 | License will therefore apply, along with any applicable section 7 213 | additional terms, to the whole of the work, and all its parts, 214 | regardless of how they are packaged. This License gives no 215 | permission to license the work in any other way, but it does not 216 | invalidate such permission if you have separately received it. 217 | 218 | d) If the work has interactive user interfaces, each must display 219 | Appropriate Legal Notices; however, if the Program has interactive 220 | interfaces that do not display Appropriate Legal Notices, your 221 | work need not make them do so. 222 | 223 | A compilation of a covered work with other separate and independent 224 | works, which are not by their nature extensions of the covered work, 225 | and which are not combined with it such as to form a larger program, 226 | in or on a volume of a storage or distribution medium, is called an 227 | "aggregate" if the compilation and its resulting copyright are not 228 | used to limit the access or legal rights of the compilation's users 229 | beyond what the individual works permit. Inclusion of a covered work 230 | in an aggregate does not cause this License to apply to the other 231 | parts of the aggregate. 232 | 233 | 6. Conveying Non-Source Forms. 234 | 235 | You may convey a covered work in object code form under the terms 236 | of sections 4 and 5, provided that you also convey the 237 | machine-readable Corresponding Source under the terms of this License, 238 | in one of these ways: 239 | 240 | a) Convey the object code in, or embodied in, a physical product 241 | (including a physical distribution medium), accompanied by the 242 | Corresponding Source fixed on a durable physical medium 243 | customarily used for software interchange. 244 | 245 | b) Convey the object code in, or embodied in, a physical product 246 | (including a physical distribution medium), accompanied by a 247 | written offer, valid for at least three years and valid for as 248 | long as you offer spare parts or customer support for that product 249 | model, to give anyone who possesses the object code either (1) a 250 | copy of the Corresponding Source for all the software in the 251 | product that is covered by this License, on a durable physical 252 | medium customarily used for software interchange, for a price no 253 | more than your reasonable cost of physically performing this 254 | conveying of source, or (2) access to copy the 255 | Corresponding Source from a network server at no charge. 256 | 257 | c) Convey individual copies of the object code with a copy of the 258 | written offer to provide the Corresponding Source. This 259 | alternative is allowed only occasionally and noncommercially, and 260 | only if you received the object code with such an offer, in accord 261 | with subsection 6b. 262 | 263 | d) Convey the object code by offering access from a designated 264 | place (gratis or for a charge), and offer equivalent access to the 265 | Corresponding Source in the same way through the same place at no 266 | further charge. You need not require recipients to copy the 267 | Corresponding Source along with the object code. If the place to 268 | copy the object code is a network server, the Corresponding Source 269 | may be on a different server (operated by you or a third party) 270 | that supports equivalent copying facilities, provided you maintain 271 | clear directions next to the object code saying where to find the 272 | Corresponding Source. Regardless of what server hosts the 273 | Corresponding Source, you remain obligated to ensure that it is 274 | available for as long as needed to satisfy these requirements. 275 | 276 | e) Convey the object code using peer-to-peer transmission, provided 277 | you inform other peers where the object code and Corresponding 278 | Source of the work are being offered to the general public at no 279 | charge under subsection 6d. 280 | 281 | A separable portion of the object code, whose source code is excluded 282 | from the Corresponding Source as a System Library, need not be 283 | included in conveying the object code work. 284 | 285 | A "User Product" is either (1) a "consumer product", which means any 286 | tangible personal property which is normally used for personal, family, 287 | or household purposes, or (2) anything designed or sold for incorporation 288 | into a dwelling. In determining whether a product is a consumer product, 289 | doubtful cases shall be resolved in favor of coverage. For a particular 290 | product received by a particular user, "normally used" refers to a 291 | typical or common use of that class of product, regardless of the status 292 | of the particular user or of the way in which the particular user 293 | actually uses, or expects or is expected to use, the product. A product 294 | is a consumer product regardless of whether the product has substantial 295 | commercial, industrial or non-consumer uses, unless such uses represent 296 | the only significant mode of use of the product. 297 | 298 | "Installation Information" for a User Product means any methods, 299 | procedures, authorization keys, or other information required to install 300 | and execute modified versions of a covered work in that User Product from 301 | a modified version of its Corresponding Source. The information must 302 | suffice to ensure that the continued functioning of the modified object 303 | code is in no case prevented or interfered with solely because 304 | modification has been made. 305 | 306 | If you convey an object code work under this section in, or with, or 307 | specifically for use in, a User Product, and the conveying occurs as 308 | part of a transaction in which the right of possession and use of the 309 | User Product is transferred to the recipient in perpetuity or for a 310 | fixed term (regardless of how the transaction is characterized), the 311 | Corresponding Source conveyed under this section must be accompanied 312 | by the Installation Information. But this requirement does not apply 313 | if neither you nor any third party retains the ability to install 314 | modified object code on the User Product (for example, the work has 315 | been installed in ROM). 316 | 317 | The requirement to provide Installation Information does not include a 318 | requirement to continue to provide support service, warranty, or updates 319 | for a work that has been modified or installed by the recipient, or for 320 | the User Product in which it has been modified or installed. Access to a 321 | network may be denied when the modification itself materially and 322 | adversely affects the operation of the network or violates the rules and 323 | protocols for communication across the network. 324 | 325 | Corresponding Source conveyed, and Installation Information provided, 326 | in accord with this section must be in a format that is publicly 327 | documented (and with an implementation available to the public in 328 | source code form), and must require no special password or key for 329 | unpacking, reading or copying. 330 | 331 | 7. Additional Terms. 332 | 333 | "Additional permissions" are terms that supplement the terms of this 334 | License by making exceptions from one or more of its conditions. 335 | Additional permissions that are applicable to the entire Program shall 336 | be treated as though they were included in this License, to the extent 337 | that they are valid under applicable law. If additional permissions 338 | apply only to part of the Program, that part may be used separately 339 | under those permissions, but the entire Program remains governed by 340 | this License without regard to the additional permissions. 341 | 342 | When you convey a copy of a covered work, you may at your option 343 | remove any additional permissions from that copy, or from any part of 344 | it. (Additional permissions may be written to require their own 345 | removal in certain cases when you modify the work.) You may place 346 | additional permissions on material, added by you to a covered work, 347 | for which you have or can give appropriate copyright permission. 348 | 349 | Notwithstanding any other provision of this License, for material you 350 | add to a covered work, you may (if authorized by the copyright holders of 351 | that material) supplement the terms of this License with terms: 352 | 353 | a) Disclaiming warranty or limiting liability differently from the 354 | terms of sections 15 and 16 of this License; or 355 | 356 | b) Requiring preservation of specified reasonable legal notices or 357 | author attributions in that material or in the Appropriate Legal 358 | Notices displayed by works containing it; or 359 | 360 | c) Prohibiting misrepresentation of the origin of that material, or 361 | requiring that modified versions of such material be marked in 362 | reasonable ways as different from the original version; or 363 | 364 | d) Limiting the use for publicity purposes of names of licensors or 365 | authors of the material; or 366 | 367 | e) Declining to grant rights under trademark law for use of some 368 | trade names, trademarks, or service marks; or 369 | 370 | f) Requiring indemnification of licensors and authors of that 371 | material by anyone who conveys the material (or modified versions of 372 | it) with contractual assumptions of liability to the recipient, for 373 | any liability that these contractual assumptions directly impose on 374 | those licensors and authors. 375 | 376 | All other non-permissive additional terms are considered "further 377 | restrictions" within the meaning of section 10. If the Program as you 378 | received it, or any part of it, contains a notice stating that it is 379 | governed by this License along with a term that is a further 380 | restriction, you may remove that term. If a license document contains 381 | a further restriction but permits relicensing or conveying under this 382 | License, you may add to a covered work material governed by the terms 383 | of that license document, provided that the further restriction does 384 | not survive such relicensing or conveying. 385 | 386 | If you add terms to a covered work in accord with this section, you 387 | must place, in the relevant source files, a statement of the 388 | additional terms that apply to those files, or a notice indicating 389 | where to find the applicable terms. 390 | 391 | Additional terms, permissive or non-permissive, may be stated in the 392 | form of a separately written license, or stated as exceptions; 393 | the above requirements apply either way. 394 | 395 | 8. Termination. 396 | 397 | You may not propagate or modify a covered work except as expressly 398 | provided under this License. Any attempt otherwise to propagate or 399 | modify it is void, and will automatically terminate your rights under 400 | this License (including any patent licenses granted under the third 401 | paragraph of section 11). 402 | 403 | However, if you cease all violation of this License, then your 404 | license from a particular copyright holder is reinstated (a) 405 | provisionally, unless and until the copyright holder explicitly and 406 | finally terminates your license, and (b) permanently, if the copyright 407 | holder fails to notify you of the violation by some reasonable means 408 | prior to 60 days after the cessation. 409 | 410 | Moreover, your license from a particular copyright holder is 411 | reinstated permanently if the copyright holder notifies you of the 412 | violation by some reasonable means, this is the first time you have 413 | received notice of violation of this License (for any work) from that 414 | copyright holder, and you cure the violation prior to 30 days after 415 | your receipt of the notice. 416 | 417 | Termination of your rights under this section does not terminate the 418 | licenses of parties who have received copies or rights from you under 419 | this License. If your rights have been terminated and not permanently 420 | reinstated, you do not qualify to receive new licenses for the same 421 | material under section 10. 422 | 423 | 9. Acceptance Not Required for Having Copies. 424 | 425 | You are not required to accept this License in order to receive or 426 | run a copy of the Program. Ancillary propagation of a covered work 427 | occurring solely as a consequence of using peer-to-peer transmission 428 | to receive a copy likewise does not require acceptance. However, 429 | nothing other than this License grants you permission to propagate or 430 | modify any covered work. These actions infringe copyright if you do 431 | not accept this License. Therefore, by modifying or propagating a 432 | covered work, you indicate your acceptance of this License to do so. 433 | 434 | 10. Automatic Licensing of Downstream Recipients. 435 | 436 | Each time you convey a covered work, the recipient automatically 437 | receives a license from the original licensors, to run, modify and 438 | propagate that work, subject to this License. You are not responsible 439 | for enforcing compliance by third parties with this License. 440 | 441 | An "entity transaction" is a transaction transferring control of an 442 | organization, or substantially all assets of one, or subdividing an 443 | organization, or merging organizations. If propagation of a covered 444 | work results from an entity transaction, each party to that 445 | transaction who receives a copy of the work also receives whatever 446 | licenses to the work the party's predecessor in interest had or could 447 | give under the previous paragraph, plus a right to possession of the 448 | Corresponding Source of the work from the predecessor in interest, if 449 | the predecessor has it or can get it with reasonable efforts. 450 | 451 | You may not impose any further restrictions on the exercise of the 452 | rights granted or affirmed under this License. For example, you may 453 | not impose a license fee, royalty, or other charge for exercise of 454 | rights granted under this License, and you may not initiate litigation 455 | (including a cross-claim or counterclaim in a lawsuit) alleging that 456 | any patent claim is infringed by making, using, selling, offering for 457 | sale, or importing the Program or any portion of it. 458 | 459 | 11. Patents. 460 | 461 | A "contributor" is a copyright holder who authorizes use under this 462 | License of the Program or a work on which the Program is based. The 463 | work thus licensed is called the contributor's "contributor version". 464 | 465 | A contributor's "essential patent claims" are all patent claims 466 | owned or controlled by the contributor, whether already acquired or 467 | hereafter acquired, that would be infringed by some manner, permitted 468 | by this License, of making, using, or selling its contributor version, 469 | but do not include claims that would be infringed only as a 470 | consequence of further modification of the contributor version. For 471 | purposes of this definition, "control" includes the right to grant 472 | patent sublicenses in a manner consistent with the requirements of 473 | this License. 474 | 475 | Each contributor grants you a non-exclusive, worldwide, royalty-free 476 | patent license under the contributor's essential patent claims, to 477 | make, use, sell, offer for sale, import and otherwise run, modify and 478 | propagate the contents of its contributor version. 479 | 480 | In the following three paragraphs, a "patent license" is any express 481 | agreement or commitment, however denominated, not to enforce a patent 482 | (such as an express permission to practice a patent or covenant not to 483 | sue for patent infringement). To "grant" such a patent license to a 484 | party means to make such an agreement or commitment not to enforce a 485 | patent against the party. 486 | 487 | If you convey a covered work, knowingly relying on a patent license, 488 | and the Corresponding Source of the work is not available for anyone 489 | to copy, free of charge and under the terms of this License, through a 490 | publicly available network server or other readily accessible means, 491 | then you must either (1) cause the Corresponding Source to be so 492 | available, or (2) arrange to deprive yourself of the benefit of the 493 | patent license for this particular work, or (3) arrange, in a manner 494 | consistent with the requirements of this License, to extend the patent 495 | license to downstream recipients. "Knowingly relying" means you have 496 | actual knowledge that, but for the patent license, your conveying the 497 | covered work in a country, or your recipient's use of the covered work 498 | in a country, would infringe one or more identifiable patents in that 499 | country that you have reason to believe are valid. 500 | 501 | If, pursuant to or in connection with a single transaction or 502 | arrangement, you convey, or propagate by procuring conveyance of, a 503 | covered work, and grant a patent license to some of the parties 504 | receiving the covered work authorizing them to use, propagate, modify 505 | or convey a specific copy of the covered work, then the patent license 506 | you grant is automatically extended to all recipients of the covered 507 | work and works based on it. 508 | 509 | A patent license is "discriminatory" if it does not include within 510 | the scope of its coverage, prohibits the exercise of, or is 511 | conditioned on the non-exercise of one or more of the rights that are 512 | specifically granted under this License. You may not convey a covered 513 | work if you are a party to an arrangement with a third party that is 514 | in the business of distributing software, under which you make payment 515 | to the third party based on the extent of your activity of conveying 516 | the work, and under which the third party grants, to any of the 517 | parties who would receive the covered work from you, a discriminatory 518 | patent license (a) in connection with copies of the covered work 519 | conveyed by you (or copies made from those copies), or (b) primarily 520 | for and in connection with specific products or compilations that 521 | contain the covered work, unless you entered into that arrangement, 522 | or that patent license was granted, prior to 28 March 2007. 523 | 524 | Nothing in this License shall be construed as excluding or limiting 525 | any implied license or other defenses to infringement that may 526 | otherwise be available to you under applicable patent law. 527 | 528 | 12. No Surrender of Others' Freedom. 529 | 530 | If conditions are imposed on you (whether by court order, agreement or 531 | otherwise) that contradict the conditions of this License, they do not 532 | excuse you from the conditions of this License. If you cannot convey a 533 | covered work so as to satisfy simultaneously your obligations under this 534 | License and any other pertinent obligations, then as a consequence you may 535 | not convey it at all. For example, if you agree to terms that obligate you 536 | to collect a royalty for further conveying from those to whom you convey 537 | the Program, the only way you could satisfy both those terms and this 538 | License would be to refrain entirely from conveying the Program. 539 | 540 | 13. Remote Network Interaction; Use with the GNU General Public License. 541 | 542 | Notwithstanding any other provision of this License, if you modify the 543 | Program, your modified version must prominently offer all users 544 | interacting with it remotely through a computer network (if your version 545 | supports such interaction) an opportunity to receive the Corresponding 546 | Source of your version by providing access to the Corresponding Source 547 | from a network server at no charge, through some standard or customary 548 | means of facilitating copying of software. This Corresponding Source 549 | shall include the Corresponding Source for any work covered by version 3 550 | of the GNU General Public License that is incorporated pursuant to the 551 | following paragraph. 552 | 553 | Notwithstanding any other provision of this License, you have 554 | permission to link or combine any covered work with a work licensed 555 | under version 3 of the GNU General Public License into a single 556 | combined work, and to convey the resulting work. The terms of this 557 | License will continue to apply to the part which is the covered work, 558 | but the work with which it is combined will remain governed by version 559 | 3 of the GNU General Public License. 560 | 561 | 14. Revised Versions of this License. 562 | 563 | The Free Software Foundation may publish revised and/or new versions of 564 | the GNU Affero General Public License from time to time. Such new versions 565 | will be similar in spirit to the present version, but may differ in detail to 566 | address new problems or concerns. 567 | 568 | Each version is given a distinguishing version number. If the 569 | Program specifies that a certain numbered version of the GNU Affero General 570 | Public License "or any later version" applies to it, you have the 571 | option of following the terms and conditions either of that numbered 572 | version or of any later version published by the Free Software 573 | Foundation. If the Program does not specify a version number of the 574 | GNU Affero General Public License, you may choose any version ever published 575 | by the Free Software Foundation. 576 | 577 | If the Program specifies that a proxy can decide which future 578 | versions of the GNU Affero General Public License can be used, that proxy's 579 | public statement of acceptance of a version permanently authorizes you 580 | to choose that version for the Program. 581 | 582 | Later license versions may give you additional or different 583 | permissions. However, no additional obligations are imposed on any 584 | author or copyright holder as a result of your choosing to follow a 585 | later version. 586 | 587 | 15. Disclaimer of Warranty. 588 | 589 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 590 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 591 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 592 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 593 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 594 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 595 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 596 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 597 | 598 | 16. Limitation of Liability. 599 | 600 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 601 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 602 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 603 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 604 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 605 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 606 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 607 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 608 | SUCH DAMAGES. 609 | 610 | 17. Interpretation of Sections 15 and 16. 611 | 612 | If the disclaimer of warranty and limitation of liability provided 613 | above cannot be given local legal effect according to their terms, 614 | reviewing courts shall apply local law that most closely approximates 615 | an absolute waiver of all civil liability in connection with the 616 | Program, unless a warranty or assumption of liability accompanies a 617 | copy of the Program in return for a fee. 618 | 619 | END OF TERMS AND CONDITIONS 620 | 621 | How to Apply These Terms to Your New Programs 622 | 623 | If you develop a new program, and you want it to be of the greatest 624 | possible use to the public, the best way to achieve this is to make it 625 | free software which everyone can redistribute and change under these terms. 626 | 627 | To do so, attach the following notices to the program. It is safest 628 | to attach them to the start of each source file to most effectively 629 | state the exclusion of warranty; and each file should have at least 630 | the "copyright" line and a pointer to where the full notice is found. 631 | 632 | 633 | Copyright (C) 634 | 635 | This program is free software: you can redistribute it and/or modify 636 | it under the terms of the GNU Affero General Public License as published 637 | by the Free Software Foundation, either version 3 of the License, or 638 | (at your option) any later version. 639 | 640 | This program is distributed in the hope that it will be useful, 641 | but WITHOUT ANY WARRANTY; without even the implied warranty of 642 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 643 | GNU Affero General Public License for more details. 644 | 645 | You should have received a copy of the GNU Affero General Public License 646 | along with this program. If not, see . 647 | 648 | Also add information on how to contact you by electronic and paper mail. 649 | 650 | If your software can interact with users remotely through a computer 651 | network, you should also make sure that it provides a way for users to 652 | get its source. For example, if your program is a web application, its 653 | interface could display a "Source" link that leads users to an archive 654 | of the code. There are many ways you could offer source, and different 655 | solutions will be better for different programs; see section 13 for the 656 | specific requirements. 657 | 658 | You should also get your employer (if you work as a programmer) or school, 659 | if any, to sign a "copyright disclaimer" for the program, if necessary. 660 | For more information on this, and how to apply and follow the GNU AGPL, see 661 | . 662 | -------------------------------------------------------------------------------- /README_zh_CN.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | # SoftVC VITS Singing Voice Conversion 本地部署教程 4 | [![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SUC-DriverOld/so-vits-svc-Deployment-Documents/blob/4.1/sovits4_for_colab.ipynb)
5 | [English](README.md) | 简体中文
6 | 本帮助文档为项目 [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc) 的详细安装、调试、推理教程,您也可以直接选择官方[README](https://github.com/svc-develop-team/so-vits-svc#readme)文档
7 | 中文文档撰写:Sucial | [Bilibili 主页](https://space.bilibili.com/445022409) 8 | 9 |
10 | 11 | --- 12 | 13 | ✨ **点击前往:[配套视频教程](https://www.bilibili.com/video/BV1Hr4y197Cy/) | [UVR5 人声分离教程](https://www.bilibili.com/video/BV1F4421c7qU/)(注意:配套视频可能较老,仅供参考,一切以最新教程文档为准!)** 14 | 15 | ✨ **相关资料:[官方 README 文档](https://github.com/svc-develop-team/so-vits-svc) | [常见报错解决方法](https://www.yuque.com/umoubuton/ueupp5/ieinf8qmpzswpsvr) | [羽毛布団](https://space.bilibili.com/3493141443250876)** 16 | 17 | > [!IMPORTANT] 18 | > 19 | > 1. **重要!写在开头!** 不想手动配置环境/伸手党/寻找整合包的,请使用 [羽毛布団](https://space.bilibili.com/3493141443250876) 大佬的整合包! 20 | > 2. **关于旧版本教程**:如需 so-vits-svc3.0 版本的教程,请切换至 [3.0 分支](https://github.com/SUC-DriverOld/so-vits-svc-Chinese-Detaild-Documents/tree/3.0)。此分支教程已停止更新! 21 | > 3. **文档的持续完善**:若遇到本文档内未提到的报错,您可以在 issues 中提问;若遇到项目 bug,请给原项目提 issues;想要更加完善这份教程,欢迎来给提 pr! 22 | 23 | # 教程目录 24 | 25 | - [SoftVC VITS Singing Voice Conversion 本地部署教程](#softvc-vits-singing-voice-conversion-本地部署教程) 26 | - [教程目录](#教程目录) 27 | - [0. 用前须知](#0-用前须知) 28 | - [任何国家,地区,组织和个人使用此项目必须遵守以下法律](#任何国家地区组织和个人使用此项目必须遵守以下法律) 29 | - [《民法典》](#民法典) 30 | - [第一千零一十九条](#第一千零一十九条) 31 | - [第一千零二十四条](#第一千零二十四条) 32 | - [第一千零二十七条](#第一千零二十七条) 33 | - [《中华人民共和国宪法》|《中华人民共和国刑法》|《中华人民共和国民法典》|《中华人民共和国合同法》](#中华人民共和国宪法中华人民共和国刑法中华人民共和国民法典中华人民共和国合同法) 34 | - [0.1 使用规约](#01-使用规约) 35 | - [0.2 硬件需求](#02-硬件需求) 36 | - [0.3 提前准备](#03-提前准备) 37 | - [1. 环境依赖](#1-环境依赖) 38 | - [1.1 so-vits-svc4.1 源码](#11-so-vits-svc41-源码) 39 | - [1.2 Python](#12-python) 40 | - [1.3 Pytorch](#13-pytorch) 41 | - [1.4 其他依赖项安装](#14-其他依赖项安装) 42 | - [1.5 FFmpeg](#15-ffmpeg) 43 | - [2. 配置及训练](#2-配置及训练) 44 | - [2.1 关于兼容 4.0 模型的问题](#21-关于兼容-40-模型的问题) 45 | - [2.2 预先下载的模型文件](#22-预先下载的模型文件) 46 | - [2.2.1 必须项](#221-必须项) 47 | - [各编码器的详解](#各编码器的详解) 48 | - [2.2.2 预训练底模 (强烈建议使用)](#222-预训练底模-强烈建议使用) 49 | - [2.2.3 可选项 (根据情况选择)](#223-可选项-根据情况选择) 50 | - [2.3 数据集准备](#23-数据集准备) 51 | - [2.4 数据预处理](#24-数据预处理) 52 | - [2.4.0 音频切片](#240-音频切片) 53 | - [2.4.1 重采样至 44100Hz 单声道](#241-重采样至-44100hz-单声道) 54 | - [2.4.2 自动划分训练集、验证集,以及自动生成配置文件](#242-自动划分训练集验证集以及自动生成配置文件) 55 | - [使用响度嵌入](#使用响度嵌入) 56 | - [2.4.3 配置文件按需求修改](#243-配置文件按需求修改) 57 | - [config.json](#configjson) 58 | - [diffusion.yaml](#diffusionyaml) 59 | - [2.4.3 生成 hubert 与 f0](#243-生成-hubert-与-f0) 60 | - [各个 f0 预测器的优缺点](#各个-f0-预测器的优缺点) 61 | - [2.5 训练](#25-训练) 62 | - [2.5.1 主模型训练(必须)](#251-主模型训练必须) 63 | - [2.5.2 扩散模型(可选)](#252-扩散模型可选) 64 | - [2.5.3 Tensorboard](#253-tensorboard) 65 | - [3. 推理](#3-推理) 66 | - [3.1 命令行推理](#31-命令行推理) 67 | - [3.2 webUI 推理](#32-webui-推理) 68 | - [4. 增强效果的可选项](#4-增强效果的可选项) 69 | - [4.1 自动 f0 预测](#41-自动-f0-预测) 70 | - [4.2 聚类音色泄漏控制](#42-聚类音色泄漏控制) 71 | - [4.3 特征检索](#43-特征检索) 72 | - [4.4 声码器微调](#44-声码器微调) 73 | - [4.5 各模型保存的目录](#45-各模型保存的目录) 74 | - [5.其他可选功能](#5其他可选功能) 75 | - [5.1 模型压缩](#51-模型压缩) 76 | - [5.2 声线混合](#52-声线混合) 77 | - [5.2.1 静态声线混合](#521-静态声线混合) 78 | - [5.2.2 动态声线混合](#522-动态声线混合) 79 | - [5.3 Onnx 导出](#53-onnx-导出) 80 | - [6. 简单混音处理及成品导出](#6-简单混音处理及成品导出) 81 | - [附录:常见报错的解决办法](#附录常见报错的解决办法) 82 | - [关于爆显存](#关于爆显存) 83 | - [安装依赖时出现的相关报错](#安装依赖时出现的相关报错) 84 | - [数据集预处理和模型训练时的相关报错](#数据集预处理和模型训练时的相关报错) 85 | - [使用 WebUI 时相关报错](#使用-webui-时相关报错) 86 | - [感谢名单](#感谢名单) 87 | 88 | # 0. 用前须知 89 | 90 | ### 任何国家,地区,组织和个人使用此项目必须遵守以下法律 91 | 92 | #### 《[民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》 93 | 94 | #### 第一千零一十九条 95 | 96 | 任何组织或者个人**不得**以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。**未经**肖像权人同意,**不得**制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。**未经**肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。 97 | **对自然人声音的保护,参照适用肖像权保护的有关规定** 98 | 99 | #### 第一千零二十四条 100 | 101 | 【名誉权】民事主体享有名誉权。任何组织或者个人**不得**以侮辱、诽谤等方式侵害他人的名誉权。 102 | 103 | #### 第一千零二十七条 104 | 105 | 【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。 106 | 107 | #### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》|《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》|《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》|《[中华人民共和国合同法](http://www.npc.gov.cn/zgrdw/npc/lfzt/rlyw/2016-07/01/content_1992739.htm)》 108 | 109 | ## 0.1 使用规约 110 | 111 | > [!WARNING] 112 | > 113 | > 1. **本教程仅供交流与学习使用,请勿用于违法违规或违反公序良德等不良用途。出于对音源提供者的尊重请勿用于鬼畜用途** 114 | > 2. **继续使用视为已同意本教程所述相关条例,本教程已进行劝导义务,不对后续可能存在问题负责** 115 | > 3. **请自行解决数据集授权问题,禁止使用非授权数据集进行训练!任何由于使用非授权数据集进行训练造成的问题,需自行承担全部责任和后果!与仓库、仓库维护者、svc develop team、教程发布者无关!** 116 | 117 | 具体使用规约如下: 118 | 119 | - 本教程内容仅代表个人,均不代表 so-vits-svc 团队及原作者观点 120 | - 本教程默认使用由 so-vits-svc 团队维护的仓库,涉及到的开源代码请自行遵守其开源协议 121 | - 任何发布到视频平台的基于 sovits 制作的视频,都必须要在简介明确指明用于变声器转换的输入源歌声、音频,例如:使用他人发布的视频或音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。 122 | - 请确保你制作数据集的数据来源合法合规,且数据提供者明确你在制作什么以及可能造成的后果。由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例。注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换! 123 | - 云端训练和推理部分可能涉及资金使用,如果你是未成年人,请在获得监护人的许可与理解后进行,未经许可引起的后续问题,本教程概不负责 124 | - 本地训练(尤其是在硬件较差的情况下)可能需要设备长时间高负荷运行,请做好设备养护和散热措施 125 | - 出于设备原因,本教程仅在 Windows 系统下进行过测试,Mac 和 Linux 请确保自己有一定解决问题能力 126 | - 继续使用视为已同意本仓库 README 所述相关条例,本仓库 README 已进行劝导义务,不对后续可能存在问题负责。 127 | 128 | ## 0.2 硬件需求 129 | 130 | 1. 训练**必须使用** GPU 进行训练!推理目前分为**命令行推理**和**WebUI 推理**,对速度要求不高的话 CPU 和 GPU 均可使用。 131 | 2. 如需自己训练,请准备至少 **6G 以上专用显存的 NVIDIA 显卡**。 132 | 3. 请确保电脑的虚拟内存设置到**30G 以上**,并且最好设置在固态硬盘,不然会很慢。 133 | 4. **云端训练建议使用 [AutoDL](https://www.autodl.com/home) 平台**。若你会使用 [Google Colab](https://colab.google/),你也可以根据我们提供的 [sovits4_for_colab.ipynb](./sovits4_for_colab.ipynb) 进行配置。 134 | 135 | ## 0.3 提前准备 136 | 137 | 1. **至少**准备约 30 分钟(当然越多越好!)的**干净歌声**作为训练集,**无底噪,无混响**。并且歌唱时最好保持**音色尽量统一**,**音域覆盖广(训练集的音域决定模型训练出来的音域!)**,歌声**响度合适**,有条件的请做好**响度匹配**,可以使用 Au 等音频处理软件进行响度匹配。 138 | 2. **重要**!请提前下载训练需要用到的**底模**,参考 [2.2.2 预训练底模](#222-预训练底模-强烈建议使用)。 139 | 3. 推理:需准备**底噪<30dB**,尽量**不要带混响和和声**的**干音**进行推理。 140 | 141 | > [!NOTE] 142 | > 143 | > **须知 1**:歌声和语音都可以作为训练集,但语音作为训练集可能使**高音和低音推理出现问题(俗称音域问题/哑音)**,因为一定程度上训练集的音域决定了模型训练出来的音域!。所以如果你最终想实现唱歌,建议使用歌声作为训练集。 144 | > 145 | > **须知 2**:使用男声模型推理女生演唱的歌曲时,若出现明显哑音,请降调推理(一般是降 12 个半音即一个八度);同理,使用女声模型推理男生演唱的歌曲时,也可以尝试升调推理。 146 | > 147 | > **✨ 2024.3.8 最新建议 ✨**:目前 [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) 项目的 TTS 与 so-vits-svc 的文字转语音相比,训练集需求量更小,训练速度更快,效果更好,所以此处建议**若想使用语音生成功能**,请移步 [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)。也因此,建议大家使用歌声素材作为训练集来训练本项目。 148 | 149 | # 1. 环境依赖 150 | 151 | ✨ **本项目需要的环境**:[NVIDIA-CUDA](https://developer.nvidia.com/cuda-toolkit) | [Python](https://www.python.org/) = 3.8.9(项目建议此版本) | [Pytorch](https://pytorch.org/get-started/locally/) | [FFmpeg](https://ffmpeg.org/) 152 | 153 | ✨ **我写的一个 so-vits-svc 一键配置环境,启动 webUI 的脚本:[so-vits-svc-webUI-QuickStart-bat](https://github.com/SUC-DriverOld/so-vits-svc-webUI-QuickStart-bat) 也可以尝试使用!** 154 | 155 | ## 1.1 so-vits-svc4.1 源码 156 | 157 | 你可以任选以下两种方法之一下载或拉取源码: 158 | 159 | 1. 从 Github 项目网页端下载源码压缩包。前往 [so-vits-svc 官方源码仓库](https://github.com/svc-develop-team/so-vits-svc),点击右上角的绿色的 `Code` 按钮,选择 `Download ZIP` 下载压缩包。若需要其他分支的代码请先更改分支。下载完成后,解压压缩包到任意目录,该目录将作为你的工作目录。 160 | 161 | 2. 使用 git 拉取源码。通过以下命令: 162 | 163 | ```bash 164 | git clone https://github.com/svc-develop-team/so-vits-svc.git 165 | ``` 166 | 167 | ## 1.2 Python 168 | 169 | - 前往 [Python 官网](https://www.python.org/) 下载 Python3.8 安装并**添加系统环境变量**。详细安装方法以及添加 Path 此处省略,网上随便一查都有,不再赘述。 170 | 171 | ```bash 172 | # conda配置方法, 将YOUR_ENV_NAME替换成你想要创建的虚拟环境名字。 173 | conda create -n YOUR_ENV_NAME python=3.8 -y 174 | conda activate YOUR_ENV_NAME 175 | # 此后每次执行命令前,请确保处于该虚拟环境下! 176 | ``` 177 | 178 | - 安装完成后在 cmd 控制台中输入`python`出现类似以下内容则安装成功: 179 | 180 | ```bash 181 | Python 3.8.9 (tags/v3.8.9:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] on win32 182 | Type "help", "copyright", "credits" or "license" for more information. 183 | >>> 184 | ``` 185 | 186 | **关于 Python 版本问题**:在进行测试后,我们认为 Python 3.8.9 能够稳定地运行该项目(但不排除高版本也可以运行)。 187 | 188 | - 配置 python 下载镜像源(有国外网络条件可跳过)在控制台执行 189 | 190 | ```bash 191 | # 设置清华大学下载镜像 192 | pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple 193 | pip config set global.trusted-host pypi.tuna.tsinghua.edu.cn 194 | ``` 195 | 196 | - 如果想要还原为默认源,类似的,你仅需要在控制台执行 197 | 198 | ```bash 199 | pip config set global.index-url https://pypi.python.org/simple 200 | ``` 201 | 202 | **python 国内常用镜像源**: 203 | 204 | - 清华: 205 | - 阿里云: 206 | 207 | ```bash 208 | # 临时更换 209 | pip install PACKAGE_NAME -i https://pypi.tuna.tsinghua.edu.cn/simple 210 | # 永久更换 211 | pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple 212 | ``` 213 | 214 | ## 1.3 Pytorch 215 | 216 | > [!IMPORTANT] 217 | > 218 | > ✨ 此处建议安装pytorch版本11.7或11.8,12.0及以上版本可能不适用. 219 | 220 | - 我们需要 **单独安装** `torch`, `torchaudio`, `torchvision` 这三个库,直接前往 [Pytorch 官网](https://pytorch.org/get-started/locally/) 选择所需版本然后复制 Run this Command 栏显示的命令至控制台安装即可。旧版本的 Pyorch 可以从 [此处](https://pytorch.org/get-started/previous-versions/) 下载到。 221 | - 安装完 `torch`, `torchaudio`, `torchvision` 这三个库之后,在 cmd 控制台运用以下命令检测 torch 能否成功调用 CUDA。最后一行出现 `True` 则成功,出现`False` 则失败,需要重新安装正确的版本。 222 | 223 | ```bash 224 | python 225 | # 回车运行 226 | import torch 227 | # 回车运行 228 | print(torch.cuda.is_available()) 229 | # 回车运行 230 | ``` 231 | 232 | > [!NOTE] 233 | > 234 | > 1. 如需手动指定 `torch` 的版本在其后面添加版本号即可,例如 `pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117` 235 | > 2. 安装 CUDA=11.7 的 Pytorch 时,可能会遇到报错 `ERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9`。此时,请先执行 `pip install networkx==3.0` ,之后再进行 Pytorch 的安装就不会出现类似报错了。 236 | > 3. 由于版本更新,11.7 的 Pytorch 可能复制不到下载链接,你可以直接复制下方的安装命令安装 Pytorch11.7。或前往 [此处](https://pytorch.org/get-started/previous-versions/) 下载旧版本。 237 | 238 | ```bash 239 | pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117 240 | ``` 241 | 242 | ## 1.4 其他依赖项安装 243 | 244 | > [!IMPORTANT] 245 | > ✨ 在开始其他依赖项安装之前,**请务必下载并安装** [Visual Studio 2022](https://visualstudio.microsoft.com/) 或者 [Microsoft C++ 生成工具](https://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/)(体积较前者更小)。**勾选并安装组件包:“使用 C++的桌面开发”**,执行修改并等待其安装完成。 246 | 247 | - 在 [1.1](#11-so-vits-svc41-源码) 解压得到的项目文件夹内右击空白处选择 **在终端中打开** 。使用下面的命令先更新一下 `pip`, `wheel`, `setuptools` 这三个包。 248 | 249 | ```bash 250 | pip install --upgrade pip==23.3.2 wheel setuptools 251 | ``` 252 | 253 | - 执行下面命令以安装库(**若出现报错请多次尝试直到没有报错,依赖全部安装完成**)。注意,项目文件夹内含有三个 requirements 的 txt ,此处选择 `requirements_win.txt`) 254 | 255 | ```bash 256 | pip install -r requirements_win.txt 257 | ``` 258 | 259 | - 确保安装 **正确无误** 后请使用下方命令更新 `fastapi`, `gradio`, `pydantic` 这三个依赖: 260 | 261 | ```bash 262 | pip install --upgrade fastapi==0.84.0 263 | pip install --upgrade pydantic==1.10.12 264 | pip install --upgrade gradio==3.41.2 265 | ``` 266 | 267 | ## 1.5 FFmpeg 268 | 269 | - 前往 [FFmpeg 官网](https://ffmpeg.org/) 下载。解压至任意位置并在环境变量中添加 Path 。定位至 `.\ffmpeg\bin`(详细安装方法以及添加 Path 此处省略,网上随便一查都有,不再赘述) 270 | - 安装完成后在 cmd 控制台中输入 `ffmpeg -version` 出现类似以下内容则安装成功: 271 | 272 | ```bash 273 | ffmpeg version git-2020-08-12-bb59bdb Copyright (c) 2000-2020 the FFmpeg developers 274 | built with gcc 10.2.1 (GCC) 20200805 275 | configuration: [此处省略一大堆内容] 276 | libavutil 56. 58.100 / 56. 58.100 277 | libavcodec 58.100.100 / 58.100.100 278 | ...[此处省略一大堆内容] 279 | ``` 280 | 281 | # 2. 配置及训练 282 | 283 | ✨ 此部分内容是整个教程文档中最重要的部分,本教程参考了 [官方文档](https://github.com/svc-develop-team/so-vits-svc#readme),并适当添加了一些解释和说明,便于理解。 284 | 285 | ✨ 在开始第二部分内容前,请确保电脑的虚拟内存设置到 **30G 以上**,并且最好设置在固态硬盘。具体设置方法请自行上网搜索。 286 | 287 | ## 2.1 关于兼容 4.0 模型的问题 288 | 289 | - 你可以通过修改 4.0 模型的 config.json 对 4.0 的模型进行支持。需要在 config.json 的 model 字段中添加 speech_encoder 字段,具体如下: 290 | 291 | ```bash 292 | "model": 293 | { 294 | # 省略其他内容 295 | 296 | # "ssl_dim",填256或者768,需要和下面"speech_encoder"匹配 297 | "ssl_dim": 256, 298 | # 说话人个数 299 | "n_speakers": 200, 300 | # 或者"vec768l12",但请注意此项的值要和上面的"ssl_dim"相互匹配。即256对应vec256l9,768对应vec768l12。 301 | "speech_encoder":"vec256l9" 302 | # 如果不知道自己的模型是vec768l12还是vec256l9,可以看"gin_channels"字段的值来确认。 303 | 304 | # 省略其他内容 305 | } 306 | ``` 307 | 308 | ## 2.2 预先下载的模型文件 309 | 310 | ### 2.2.1 必须项 311 | 312 | > [!WARNING] 313 | > 314 | > **以下编码器必须需选择一个使用:** 315 | > 316 | > - "vec768l12" 317 | > - "vec256l9" 318 | > - "vec256l9-onnx" 319 | > - "vec256l12-onnx" 320 | > - "vec768l9-onnx" 321 | > - "vec768l12-onnx" 322 | > - "hubertsoft-onnx" 323 | > - "hubertsoft" 324 | > - "whisper-ppg" 325 | > - "cnhubertlarge" 326 | > - "dphubert" 327 | > - "whisper-ppg-large" 328 | > - "wavlmbase+" 329 | 330 | **中国用户可以在[这里](https://pan.baidu.com/s/17IlNuphFHAntLklkMNtagg?pwd=dkp9)下载百度网盘转存的模型** 331 | 332 | | 编码器 | 下载地址 | 放置位置 | 说明 | 333 | | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- | 334 | | contentvec(推荐) | [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr) | 放在`pretrain`目录下 | `vec768l12`与`vec256l9` 需要该编码器 | 335 | | | [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) | 将文件名改为 checkpoint_best_legacy_500.pt 后,放在`pretrain`目录下 | 与上面的`checkpoint_best_legacy_500.pt`效果相同,但大小只有 199MB | 336 | | hubertsoft | [hubert-soft-0d54a1f4.pt](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt) | 放在`pretrain`目录下 | so-vits-svc3.0 使用的是该编码器 | 337 | | Whisper-ppg | [medium.pt](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt) | 放在`pretrain`目录下 | 该模型适配`whisper-ppg` | 338 | | whisper-ppg-large | [large-v2.pt](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt) | 放在`pretrain`目录下 | 该模型适配`whisper-ppg-large` | 339 | | cnhubertlarge | [chinese-hubert-large-fairseq-ckpt.pt](https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt) | 放在`pretrain`目录下 | - | 340 | | dphubert | [DPHuBERT-sp0.75.pth](https://huggingface.co/pyf98/DPHuBERT/resolve/main/DPHuBERT-sp0.75.pth) | 放在`pretrain`目录下 | - | 341 | | WavLM | [WavLM-Base+.pt](https://valle.blob.core.windows.net/share/wavlm/WavLM-Base+.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D) | 放在`pretrain`目录下 | 下载链接可能存在问题,无法下载 | 342 | | OnnxHubert/ContentVec | [MoeSS-SUBModel](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main) | 放在`pretrain`目录下 | - | 343 | 344 | #### 各编码器的详解 345 | 346 | | 编码器名称 | 优点 | 缺点 | 347 | | --------------------- | ------------------------------------ | -------------------- | 348 | | `vec768l12`(最推荐) | 最还原音色、有大型底模、支持响度嵌入 | 咬字能力较弱 | 349 | | `vec256l9` | 无特别的优点 | 不支持扩散模型 | 350 | | `hubertsoft` | 咬字能力较强 | 音色泄露 | 351 | | `whisper-ppg` | 咬字最强 | 音色泄露、显存占用高 | 352 | 353 | ### 2.2.2 预训练底模 (强烈建议使用) 354 | 355 | - 预训练底模文件: `G_0.pth` `D_0.pth`。放在`logs/44k`目录下。 356 | 357 | - 扩散模型预训练底模文件: `model_0.pt`。放在`logs/44k/diffusion`目录下。 358 | 359 | 扩散模型引用了[DDSP-SVC](https://github.com/yxlllc/DDSP-SVC)的 Diffusion Model,底模与[DDSP-SVC](https://github.com/yxlllc/DDSP-SVC)的扩散模型底模通用。以下提供的底模文件部分来自“[羽毛布団](https://space.bilibili.com/3493141443250876)”的整合包,在此表示感谢。 360 | 361 | **提供 4.1 训练底模,请自行下载(需具备外网条件)** 362 | 363 | **中国用户可以在[这里](https://pan.baidu.com/s/17IlNuphFHAntLklkMNtagg?pwd=dkp9)下载百度网盘转存的模型** 364 | 365 | | 编码器类型 | 主模型底模 | 扩散模型底模 | 说明 | 366 | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 367 | | vec768l12 | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/model_0.pt) | 若仅训练 100 步扩散,即`k_step_max = 100`,扩散模型请使用[model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/max100/model_0.pt) | 368 | | vec768l12(开启响度嵌入) | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/vol_emb/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/vol_emb/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/model_0.pt) | 若仅训练 100 步扩散,即`k_step_max = 100`,扩散模型请使用[model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/max100/model_0.pt) | 369 | | vec256l9 | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec256l9/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec256l9/D_0.pth) | 不支持扩散 | - | 370 | | hubertsoft | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/hubertsoft/G_0.pth), [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/hubertsoft/G_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/hubertsoft/model_0.pt) | - | 371 | | whisper-ppg | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/whisper-ppg/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/whisper-ppg/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/whisper-ppg/model_0.pt) | - | 372 | | tiny(vec768l12_vol_emb) | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/tiny/vec768l12_vol_emb/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/tiny/vec768l12_vol_emb/D_0.pth) | - | TINY 是在原版 So-VITS 模型的基础上减少了网络参数,并采用了深度可分离卷积(Depthwise Separable Convolution)和 FLOW 共享参数技术,使得模型的大小明显减小,并且提升了推理速度。TINY 模型是专门设计以用来实时转换的,模型的参数减少意味着其转换效果理论上不如原版模型。目前适用于 So-VITS 的实时转换 GUI 还在开发中,在此之前,如果没有特殊需求,不建议训练 TINY 模型。 | 373 | 374 | > [!WARNING] 375 | > 376 | > 其他未提及的编码器不提供预训练模型,请无底模训练,训练难度可能会大幅提升! 377 | 378 | **底模及支持性** 379 | 380 | | 标准底模 | 响度嵌入 | 响度嵌入+TINY | 完整扩散 | 100 步浅扩散 | 381 | | ----------- | -------- | ------------- | -------- | ------------ | 382 | | Vec768L12 | 支持 | 支持 | 支持 | 支持 | 383 | | Vec256L9 | 支持 | 不支持 | 不支持 | 不支持 | 384 | | hubertsoft | 支持 | 不支持 | 支持 | 不支持 | 385 | | whisper-ppg | 支持 | 不支持 | 支持 | 不支持 | 386 | 387 | ### 2.2.3 可选项 (根据情况选择) 388 | 389 | **1. NSF-HIFIGAN** 390 | 391 | 如果使用`NSF-HIFIGAN增强器`或`浅层扩散`的话,需要下载预训练的 NSF-HIFIGAN 模型(由[OpenVPI](https://github.com/openvpi)提供),如果不需要可以不下载。 392 | 393 | - 预训练的 NSF-HIFIGAN 声码器 : 394 | - 2022.12 版:[nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip); 395 | - 2024.02 版:[nsf_hifigan_44.1k_hop512_128bin_2024.02.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-44.1k-hop512-128bin-2024.02/nsf_hifigan_44.1k_hop512_128bin_2024.02.zip) 396 | - 解压后,将四个文件放在`pretrain/nsf_hifigan`目录下。 397 | - 若下载 2024.02 版的声码器,需要将`model.ckpt`重命名为`model`,即去掉后缀。 398 | 399 | **2. RMVPE** 400 | 401 | 如果使用`rmvpe`F0 预测器的话,需要下载预训练的 RMVPE 模型。 402 | 403 | - 下载模型[rmvpe.zip](https://github.com/yxlllc/RMVPE/releases/download/230917/rmvpe.zip),目前首推该权重。 404 | - 解压缩`rmvpe.zip`,并将其中的`model.pt`文件改名为`rmvpe.pt`并放在`pretrain`目录下。 405 | 406 | **3. FCPE(预览版)** 407 | 408 | [FCPE](https://github.com/CNChTu/MelPE) (Fast Context-base Pitch Estimator) 是由 svc-develop-team 自主研发的一款全新的 F0 预测器,是一个为实时语音转换所设计的专用 F0 预测器,他将在未来成为 Sovits 实时语音转换的首选 F0 预测器。 409 | 410 | 如果使用 `fcpe` F0 预测器的话,需要下载预训练的 FCPE 模型。 411 | 412 | - 下载模型 [fcpe.pt](https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/fcpe.pt) 413 | - 放在`pretrain`目录下。 414 | 415 | ## 2.3 数据集准备 416 | 417 | 1. 按照以下文件结构将数据集放入 `dataset_raw` 目录。 418 | 419 | ``` 420 | dataset_raw 421 | ├───speaker0 422 | │ ├───xxx1-xxx1.wav 423 | │ ├───... 424 | │ └───Lxx-0xx8.wav 425 | └───speaker1 426 | ├───xx2-0xxx2.wav 427 | ├───... 428 | └───xxx7-xxx007.wav 429 | ``` 430 | 431 | 2. 可以自定义说话人名称。 432 | 433 | ``` 434 | dataset_raw 435 | └───suijiSUI 436 | ├───1.wav 437 | ├───... 438 | └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav 439 | ``` 440 | 441 | 3. 此外还需要在`dataset_raw`新建并编辑`config.json` 442 | 443 | ```json 444 | "n_speakers": 10 445 | 446 | "spk":{ 447 | "speaker0": 0, 448 | "speaker1": 1 449 | } 450 | ``` 451 | 452 | - `"n_speakers": 10`: 数字代表说话人人数,从 1 开始计数,需要和下面的人数对应 453 | - `"speaker0": 0`: speaker0 指的是说话人名字,可以更改, 后面的数字 0,1,2...代表说话人计数,从 0 开始计数 454 | 455 | ## 2.4 数据预处理 456 | 457 | ### 2.4.0 音频切片 458 | 459 | - 将音频切片至`5s - 15s`, 稍微长点也无伤大雅,实在太长可能会导致训练中途甚至预处理就爆显存。 460 | 461 | - 可以使用 [audio-slicer-GUI](https://github.com/flutydeer/audio-slicer) | [audio-slicer-CLI](https://github.com/openvpi/audio-slicer) 进行辅助切片。一般情况下只需调整其中的 `Minimum Interval`,普通说话素材通常保持默认即可,歌唱素材可以调整至 `100` 甚至 `50`。 462 | 463 | - 切完之后请手动处理过长(大于 15 秒)或过短(小于 4 秒)的音频,过短的音频可以多段拼接,过长的音频可以手动切分。 464 | 465 | > [!WARNING] 466 | > 467 | > **如果你使用 Whisper-ppg 声音编码器进行训练,所有的切片长度必须小于 30s** 468 | 469 | ### 2.4.1 重采样至 44100Hz 单声道 470 | 471 | 使用下面的命令(若已经经过响度匹配,请跳过该行看下面的 NOTE): 472 | 473 | ```bash 474 | python resample.py 475 | ``` 476 | 477 | > [!NOTE] 478 | > 479 | > 虽然本项目拥有重采样、转换单声道与响度匹配的脚本 `resample.py`,但是默认的响度匹配是匹配到 0db。这可能会造成音质的受损。而 `python` 的响度匹配包 `pyloudnorm` 无法对电平进行压限,这会导致爆音。所以建议可以考虑使用专业声音处理软件如 `Adobe Audition` 等软件做响度匹配处理。此处也可以使用我写的一个响度匹配工具 [Loudness Matching Tool](https://github.com/AI-Hobbyist/Loudness-Matching-Tool) 进行处理。若已经使用其他软件做响度匹配,可以在运行上述命令时添加 `--skip_loudnorm` 跳过响度匹配步骤。如: 480 | 481 | ```bash 482 | python resample.py --skip_loudnorm 483 | ``` 484 | 485 | ### 2.4.2 自动划分训练集、验证集,以及自动生成配置文件 486 | 487 | 使用下面的命令(若需要响度嵌入,请跳过该行看下面的使用响度嵌入): 488 | 489 | ```bash 490 | python preprocess_flist_config.py --speech_encoder vec768l12 491 | ``` 492 | 493 | speech_encoder 拥有以下七个选择,具体讲解请看 **[2.2.1 必须项及各编码器的详解](#各编码器的详解)**。如果省略 speech_encoder 参数,默认值为 vec768l12。 494 | 495 | ``` 496 | vec768l12 497 | vec256l9 498 | hubertsoft 499 | whisper-ppg 500 | whisper-ppg-large 501 | cnhubertlarge 502 | dphubert 503 | ``` 504 | 505 | #### 使用响度嵌入 506 | 507 | - 使用响度嵌入后训练出的模型将匹配到输入源响度,否则为训练集响度。 508 | - 若使用响度嵌入,需要增加`--vol_aug`参数,比如: 509 | 510 | ```bash 511 | python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug 512 | ``` 513 | 514 | ### 2.4.3 配置文件按需求修改 515 | 516 | #### config.json 517 | 518 | - `vocoder_name`: 选择一种声码器,默认为`nsf-hifigan` 519 | - `log_interval`:多少步输出一次日志,默认为 `200` 520 | - `eval_interval`:多少步进行一次验证并保存一次模型,默认为 `800` 521 | - `epochs`:训练总轮数,默认为 `10000`,达到此轮数后将自动停止训练 522 | - `learning_rate`:学习率,建议保持默认值不要改 523 | - `batch_size`:单次训练加载到 GPU 的数据量,调整到低于显存容量的大小即可 524 | - `all_in_mem`:加载所有数据集到内存中,某些平台的硬盘 IO 过于低下、同时内存容量 **远大于** 数据集体积时可以启用 525 | - `keep_ckpts`:训练时保留最后几个模型,`0`为保留所有,默认只保留最后`3`个 526 | 527 | **声码器列表** 528 | 529 | ``` 530 | nsf-hifigan 531 | nsf-snake-hifigan 532 | ``` 533 | 534 | #### diffusion.yaml 535 | 536 | - `cache_all_data`:加载所有数据集到内存中,某些平台的硬盘 IO 过于低下、同时内存容量 **远大于** 数据集体积时可以启用 537 | - `duration`:训练时音频切片时长,可根据显存大小调整,**注意,该值必须小于训练集内音频的最短时间!** 538 | - `batch_size`:单次训练加载到 GPU 的数据量,调整到低于显存容量的大小即可 539 | - `timesteps` : 扩散模型总步数,默认为 1000。完整的高斯扩散一共 1000 步 540 | - `k_step_max` : 训练时可仅训练 `k_step_max` 步扩散以节约训练时间,注意,该值必须小于`timesteps`,0 为训练整个扩散模型,**注意,如果不训练整个扩散模型将无法使用仅扩散模型推理!** 541 | 542 | ### 2.4.3 生成 hubert 与 f0 543 | 544 | 使用下面的命令(若需要训练浅扩散,请跳过该行看下面的浅扩散): 545 | 546 | ```bash 547 | # 下面的命令使用了rmvpe作为f0预测器,你可以手动进行修改 548 | python preprocess_hubert_f0.py --f0_predictor rmvpe 549 | ``` 550 | 551 | f0_predictor 拥有六个选择,部分 f0 预测器需要额外下载预处理模型,具体请参考 **[2.2.3 可选项 (根据情况选择)](#223-可选项-根据情况选择)** 552 | 553 | ``` 554 | crepe 555 | dio 556 | pm 557 | harvest 558 | rmvpe(推荐!) 559 | fcpe 560 | ``` 561 | 562 | #### 各个 f0 预测器的优缺点 563 | 564 | | 预测器 | 优点 | 缺点 | 565 | | ------- | --------------------------------------------------------- | -------------------------------------------- | 566 | | pm | 速度快,占用低 | 容易出现哑音 | 567 | | crepe | 基本不会出现哑音 | 显存占用高,自带均值滤波,因此可能会出现跑调 | 568 | | dio | - | 可能跑调 | 569 | | harvest | 低音部分有更好表现 | 其他音域就不如别的算法了 | 570 | | rmvpe | 六边形战士,目前最完美的预测器 | 几乎没有缺点(极端长低音可能会出错) | 571 | | fcpe | SVC 开发组自研,目前最快的预测器,且有不输 crepe 的准确度 | - | 572 | 573 | > [!NOTE] 574 | > 575 | > 1. 如果训练集过于嘈杂,请使用 crepe 处理 f0 576 | > 2. 如果省略 f0_predictor 参数,默认值为 rmvpe 577 | 578 | **若需要浅扩散功能(可选),需要增加 `--use_diff` 参数,比如:** 579 | 580 | ```bash 581 | # 下面的命令使用了rmvpe作为f0预测器,你可以手动进行修改 582 | python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff 583 | ``` 584 | 585 | **如果处理速度较为缓慢,或者你的数据集较大,可以加上`--num_processes`参数:** 586 | 587 | ```bash 588 | # 下面的命令使用了rmvpe作为f0预测器,你可以手动进行修改 589 | python preprocess_hubert_f0.py --f0_predictor rmvpe --num_processes 8 590 | # 所有的Workers会被自动分配到多个线程上 591 | ``` 592 | 593 | 执行完以上步骤后生成的 `dataset` 目录便是预处理完成的数据,此时你可以按需删除 `dataset_raw` 文件夹了。 594 | 595 | ## 2.5 训练 596 | 597 | ### 2.5.1 主模型训练(必须) 598 | 599 | 使用下面的命令训练主模型,训练暂停后也可使用下面的命令继续训练。 600 | 601 | ```bash 602 | python train.py -c configs/config.json -m 44k 603 | ``` 604 | 605 | ### 2.5.2 扩散模型(可选) 606 | 607 | - So-VITS-SVC 4.1 的一个重大更新就是引入了浅扩散 (Shallow Diffusion) 机制,将 SoVITS 的原始输出音频转换为 Mel 谱图,加入噪声并进行浅扩散处理后经过声码器输出音频。经过测试,**原始输出音频在经过浅扩散处理后可以显著改善电音、底噪等问题,输出质量得到大幅增强**。 608 | - 尚若需要浅扩散功能,需要训练扩散模型,训练前请确保你已经下载并正确放置好了 `NSF-HIFIGAN` (**参考 [2.2.3 可选项](#223-可选项-根据情况选择)**),并且预处理生成 hubert 与 f0 时添加了 `--use_diff` 参数(**参考 [2.4.3 生成 hubert 与 f0](#243-生成-hubert-与-f0)**) 609 | 610 | 扩散模型训练方法为: 611 | 612 | ```bash 613 | python train_diff.py -c configs/diffusion.yaml 614 | ``` 615 | 616 | 模型训练结束后,模型文件保存在`logs/44k`目录下,扩散模型在`logs/44k/diffusion`下。 617 | 618 | > [!IMPORTANT] 619 | > 620 | > **模型怎样才算训练好了**? 621 | > 622 | > 1. 这是一个非常无聊且没有意义的问题。就好比上来就问老师我家孩子怎么才能学习好,除了你自己,没有人能回答这个问题。 623 | > 2. 模型的训练关联于你的数据集质量、时长,所选的编码器、f0 算法,甚至一些超自然的玄学因素,即便你有一个成品模型,最终的转换效果也要取决于你的输入源以及推理参数。这不是一个线性的的过程,之间的变量实在是太多,所以你非得问“为什么我的模型出来不像啊”、“模型怎样才算训练好了”这样的问题,我只能说 WHO F\*\*KING KNOWS? 624 | > 3. 但也不是一点办法没有,只能烧香拜佛了。我不否认烧香拜佛当然是一个有效的手段,但你也可以借助一些科学的工具,例如 Tensorboard 等,下方 [2.5.3](#253-tensorboard) 就将教你怎么通过看 Tensorboard 来辅助了解训练状态,**当然,最强的辅助工具其实长在你自己身上,一个声学模型怎样才算训练好了? 塞上耳机,让你的耳朵告诉你吧**。 625 | 626 | **Epoch 和 Step 的关系**: 627 | 628 | 训练过程中会根据你在 `config.json` 中设置的保存步数(默认为 800 步,与 `eval_interval` 的值对应)保存一次模型。 629 | 请严格区分轮数 (Epoch) 和步数 (Step):1 个 Epoch 代表训练集中的所有样本都参与了一次学习,1 Step 代表进行了一步学习,由于 batch_size 的存在,每步学习可以含有数条样本,因此,Epoch 和 Step 的换算如下: 630 | 631 | $$ 632 | Epoch = \frac{Step}{(\text{数据集条数}{\div}batch\_size)} 633 | $$ 634 | 635 | 训练默认 10000 轮后结束(可以通过修改`config.json`中的`epoch`字段的数值来增加或减小上限),但正常训练通常只需要数百轮即可有较好的效果。当你觉得训练差不多完成了,可以在训练终端按 `Ctrl + C` 中断训练。中断后只要没有重新预处理训练集,就可以 **从最近一次保存点继续训练**。 636 | 637 | ### 2.5.3 Tensorboard 638 | 639 | 你可以用 Tensorboard 来查看训练过程中的损失函数值 (loss) 趋势,试听音频,从而辅助判断模型训练状态。**但是,就 So-VITS-SVC 这个项目而言,损失函数值(loss)并没有什么实际参考意义(你不用刻意对比研究这个值的高低),真正有参考意义的还是推理后靠你的耳朵来听!** 640 | 641 | - 使用下面的命令打开 Tensorboard: 642 | 643 | ```bash 644 | tensorboard --logdir=./logs/44k 645 | ``` 646 | 647 | Tensorboard 是根据训练时默认每 200 步的评估生成日志的,如果训练未满 200 步,则 Tensorboard 中不会出现任何图像。200 这个数值可以通过修改 `config.json` 中的 `log_interval` 值来修改。 648 | 649 | - Losses 详解 650 | 651 | 你不需要理解每一个 loss 的具体含义,大致来说: 652 | 653 | - loss/g/f0、loss/g/mel 和 loss/g/total 应当是震荡下降的,并最终收敛在某个值 654 | - loss/g/kl 应当是低位震荡的 655 | - loss/g/fm 应当在训练的中期持续上升,并在后期放缓上升趋势甚至开始下降 656 | 657 | > [!IMPORTANT] 658 | > 659 | > ✨ 观察 losses 曲线的趋势可以帮助你判断模型的训练状态。但 losses 并不能作为判断模型训练状态的唯一参考,**甚至它的参考价值其实并不大,你仍需要通过自己的耳朵来判断模型是否训练好了**。 660 | 661 | > [!WARNING] 662 | > 663 | > 1. 对于小数据集(30 分钟甚至更小),在加载底模的情况下,不建议训练过久,这样是为了尽可能利用底模的优势。数千步甚至数百步就能有最好的结果。 664 | > 2. Tensorboard 中的试听音频是根据你的验证集生成的,**无法代表模型最终的表现**。 665 | 666 | # 3. 推理 667 | 668 | ✨ 推理时请先准备好需要推理的干声,确保干声无底噪/无混响/质量较好。你可以使用 [UVR5](https://github.com/Anjok07/ultimatevocalremovergui/releases/tag/v5.6) 进行处理,得到干声。此外,我也制作了一个 [UVR5 人声分离教程](https://www.bilibili.com/video/BV1F4421c7qU/)。 669 | 670 | ## 3.1 命令行推理 671 | 672 | 使用 inference_main.py 进行推理 673 | 674 | ```bash 675 | # 例 676 | python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "你的推理音频.wav" -t 0 -s "speaker" 677 | ``` 678 | 679 | **必填项部分:** 680 | 681 | - `-m` | `--model_path`:模型路径 682 | - `-c` | `--config_path`:配置文件路径 683 | - `-n` | `--clean_names`:wav 文件名列表,放在 raw 文件夹下 684 | - `-t` | `--trans`:音高调整,支持正负(半音) 685 | - `-s` | `--spk_list`:合成目标说话人名称 686 | - `-cl` | `--clip`:音频强制切片,默认 0 为自动切片,单位为秒/s。 687 | 688 | > [!NOTE] 689 | > 690 | > **音频切片** 691 | > 692 | > - 推理时,切片工具会将上传的音频根据静音段切分为数个小段,分别推理后合并为完整音频。这样做的好处是**小段音频推理显存占用低,因而可以将长音频切分推理以免爆显存**。切片阈值参数控制的是最小满刻度分贝值,低于这个值将被切片工具视为静音并去除。因此,当上传的音频比较嘈杂时,可以将该参数设置得高一些(如 -30),反之,可以将该值设置得小一些(如 -50)避免切除呼吸声和细小人声。 693 | > 694 | > - 开发团队近期的一项测试表明,较小的切片阈值(如-50)会改善输出的咬字,至于原理暂不清楚。 695 | > 696 | > **强制切片** `-cl` | `--clip` 697 | > 698 | > - 推理时,切片工具会将上传的音频根据静音段切分为数个小段,分别推理后合并为完整音频。但有时当人声过于连续,长时间不存在静音段时,切片工具也会相应切出来过长的音频,容易导致爆显存。自动音频切片功能则是设定了一个最长音频切片时长,初次切片后,如果存在长于该时长的音频切片,将会被按照该时长二次强制切分,避免了爆显存的问题。 699 | > - 强制切片可能会导致音频从一个字的中间切开,分别推理再合并时可能会存在人声不连贯。你需要在高级设置中设置强制切片的交叉淡入长度来避免这一问题。 700 | 701 | **可选项部分:部分具体见下一节** 702 | 703 | - `-lg` | `--linear_gradient`:两段音频切片的交叉淡入长度,如果强制切片后出现人声不连贯可调整该数值,如果连贯建议采用默认值 0,单位为秒 704 | - `-f0p` | `--f0_predictor`:选择 F0 预测器,可选择 crepe,pm,dio,harvest,rmvpe,fcpe, 默认为 pm(注意:crepe 为原 F0 使用均值滤波器),不同 F0 预测器的优缺点请 **参考 [2.4.3 中的 F0 预测器的优缺点](#各个-f0-预测器的优缺点)** 705 | - `-a` | `--auto_predict_f0`:语音转换自动预测音高,转换歌声时不要打开这个会严重跑调 706 | - `-cm` | `--cluster_model_path`:聚类模型或特征检索索引路径,留空则自动设为各方案模型的默认路径,如果没有训练聚类或特征检索则随便填 707 | - `-cr` | `--cluster_infer_ratio`:聚类方案或特征检索占比,范围 0-1,若没有训练聚类模型或特征检索则默认 0 即可 708 | - `-eh` | `--enhance`:是否使用 NSF_HIFIGAN 增强器,该选项对部分训练集少的模型有一定的音质增强效果,但是对训练好的模型有反面效果,默认关闭 709 | - `-shd` | `--shallow_diffusion`:是否使用浅层扩散,使用后可解决一部分电音问题,默认关闭,该选项打开时,NSF_HIFIGAN 增强器将会被禁止 710 | - `-usm` | `--use_spk_mix`:是否使用角色融合/动态声线融合 711 | - `-lea` | `--loudness_envelope_adjustment`:输入源响度包络替换输出响度包络融合比例,越靠近 1 越使用输出响度包络 712 | - `-fr` | `--feature_retrieval`:是否使用特征检索,如果使用聚类模型将被禁用,且 cm 与 cr 参数将会变成特征检索的索引路径与混合比例 713 | 714 | > [!NOTE] 715 | > 716 | > **聚类模型/特征检索混合比例** `-cr` | `--cluster_infer_ratio` 717 | > 718 | > - 该参数控制的是使用聚类模型/特征检索模型时线性参与的占比。聚类模型和特征检索均可以有限提升音色相似度,但带来的代价是会降低咬字准确度(特征检索的咬字比聚类稍好一些)。该参数的范围为 0-1, 0 为不启用,越靠近 1, 则音色越相似,咬字越模糊。 719 | > - 聚类模型和特征检索共用这一参数,当加载模型时使用了何种模型,则该参数控制何种模型的混合比例。 720 | > - **注意,当未加载聚类模型或特征检索模型时,请保持改参数为 0,否则会报错。** 721 | 722 | **浅扩散设置:** 723 | 724 | - `-dm` | `--diffusion_model_path`:扩散模型路径 725 | - `-dc` | `--diffusion_config_path`:扩散模型配置文件路径 726 | - `-ks` | `--k_step`:扩散步数,越大越接近扩散模型的结果,默认 100 727 | - `-od` | `--only_diffusion`:纯扩散模式,该模式不会加载 sovits 模型,以扩散模型推理 728 | - `-se` | `--second_encoding`:二次编码,浅扩散前会对原始音频进行二次编码,玄学选项,有时候效果好,有时候效果差 729 | 730 | > [!NOTE] 731 | > 732 | > **关于浅扩散步数** `-ks` | `--k_step` 733 | > 734 | > 完整的高斯扩散为 1000 步,当浅扩散步数达到 1000 步时,此时的输出结果完全是扩散模型的输出结果,So-VITS 模型将被抑制。浅扩散步数越高,越接近扩散模型输出的结果。**如果你只是想用浅扩散去除电音底噪,尽可能保留 So-VITS 模型的音色,浅扩散步数可以设定为 30-50** 735 | 736 | > [!WARNING] 737 | > 738 | > 如果使用 `whisper-ppg` 声音编码器进行推理,需要将 `--clip` 设置为 25,`-lg` 设置为 1。否则将无法正常推理。 739 | 740 | ## 3.2 webUI 推理 741 | 742 | 使用以下命令打开 webui 界面,**上传模型并且加载,按照说明按需填写推理,上传推理音频,开始推理。** 743 | 744 | 参数具体的推理参数的详解和上面的 [3.1 命令行推理](#31-命令行推理) 参数一致,只不过搬到了交互式界面上去,并且附有简单的说明。 745 | 746 | ```bash 747 | python webUI.py 748 | ``` 749 | 750 | > [!WARNING] 751 | > 752 | > **请务必查看 [命令行推理](#31-命令行推理) 以了解具体参数的含义。并且请特别注意 NOTE 和 WARNING 中的提醒!** 753 | 754 | webUI 中还内置了 **文本转语音** 功能: 755 | 756 | - 文本转语音使用微软的 edge_TTS 服务生成一段原始语音,再通过 So-VITS 将这段语音的声线转换为目标声线。So-VITS 只能实现歌声转换 (SVC) 功能,没有任何 **原生** 的文本转语音 (TTS) 功能!由于微软的 edge_TTS 生成的语音较为僵硬,没有感情,所有转换出来的音频当然也会这一。**如果你需要有感情的 TTS 功能,请移步 [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) 项目。** 757 | - 目前文本转语音共支持 55 种语言,涵盖了大部分常见语言。程序会根据文本框内输入的文本自动识别语言并转换。 758 | - 自动识别只能识别到语种,而某些语种可能涵盖不同的口音,说话人,如果使用了自动识别,程序会从符合该语种和指定性别的说话人种随机挑选一个来转换。如果你的目标语种说话人口音比较多(例如英语),建议手动指定一个口音的说话人。如果手动指定了说话人,则先前手动选择的性别会被抑制。 759 | 760 | # 4. 增强效果的可选项 761 | 762 | ✨ 如果前面的效果已经满意,或者没看明白下面在讲啥,那后面的内容都可以忽略,不影响模型使用(这些可选项影响比较小,可能在某些特定数据上有点效果,但大部分情况似乎都感知不太明显) 763 | 764 | ## 4.1 自动 f0 预测 765 | 766 | 模型训练过程会训练一个 f0 预测器,是一个自动变调的功能,可以将模型音高匹配到推理源音高,用于说话声音转换时可以打开,能够更好匹配音调。**但转换歌声时请不要启用此功能!!!会严重跑调!!** 767 | 768 | - 命令行推理:在 `inference_main` 中设置 `auto_predict_f0` 为 `true` 即可 769 | - webUI 推理:勾选相应选项即可 770 | 771 | ## 4.2 聚类音色泄漏控制 772 | 773 | 聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点。 774 | 775 | 使用聚类前面的已有步骤不用进行任何的变动,只需要额外训练一个聚类模型,虽然效果比较有限,但训练成本也比较低。 776 | 777 | - 训练方法: 778 | 779 | ```bash 780 | # 使用CPU训练: 781 | python cluster/train_cluster.py 782 | # 或者使用GPU训练: 783 | python cluster/train_cluster.py --gpu 784 | ``` 785 | 786 | 训练结束后,模型的输出会在 `logs/44k/kmeans_10000.pt` 787 | 788 | - 命令行推理过程: 789 | - `inference_main.py` 中指定 `cluster_model_path` 790 | - `inference_main.py` 中指定 `cluster_infer_ratio`,`0`为完全不使用聚类,`1`为只使用聚类,通常设置`0.5`即可 791 | - webUI 推理过程: 792 | - 上传并加载聚类模型 793 | - 设置聚类模型/特征检索混合比例,0-1 之间,0 即不启用聚类/特征检索。使用聚类/特征检索能提升音色相似度,但会导致咬字下降(如果使用建议 0.5 左右) 794 | 795 | ## 4.3 特征检索 796 | 797 | 跟聚类方案一样可以减小音色泄漏,咬字比聚类稍好,但会降低推理速度,采用了融合的方式,可以线性控制特征检索与非特征检索的占比。 798 | 799 | - 训练过程:需要在生成 hubert 与 f0 后执行: 800 | 801 | ```bash 802 | python train_index.py -c configs/config.json 803 | ``` 804 | 805 | 训练结束后模型的输出会在 `logs/44k/feature_and_index.pkl` 806 | 807 | - 命令行推理过程: 808 | - 需要首先制定 `--feature_retrieval`,此时聚类方案会自动切换到特征检索方案 809 | - `inference_main.py` 中指定 `cluster_model_path` 为模型输出文件 810 | - `inference_main.py` 中指定 `cluster_infer_ratio`,`0`为完全不使用特征检索,`1`为只使用特征检索,通常设置`0.5`即可 811 | - webUI 推理过程: 812 | - 上传并加载聚类模型 813 | - 设置聚类模型/特征检索混合比例,0-1 之间,0 即不启用聚类/特征检索。使用聚类/特征检索能提升音色相似度,但会导致咬字下降(如果使用建议 0.5 左右) 814 | 815 | ## 4.4 声码器微调 816 | 817 | 在 So-VITS 中使用扩散模型时,经过扩散模型增强的 Mel 谱图会经过声码器(Vocoder)输出为最终音频。声码器在其中对输出音频的音质起到了决定性的作用。So-VITS-SVC 目前使用的是 [NSF-HiFiGAN 社区声码器](https://openvpi.github.io/vocoders/),实际上,你也可以用你自己的数据集对该声码器模型进行微调训练,在 So-VITS 的 **扩散流程** 中使用微调后的声码器,使其更符合你的模型任务。 818 | 819 | [SingingVocoders](https://github.com/openvpi/SingingVocoders) 项目提供了对声码器的微调方法,在 Diffusion-SVC 项目中,**使用微调声码器可以使输出音质得到大幅增强**。你也可以自行使用自己的数据集训练一个微调声码器,并在本整合包中使用。 820 | 821 | 1. 使用 [SingingVocoders](https://github.com/openvpi/SingingVocoders) 训练一个微调声码器,并获得其模型和配置文件 822 | 2. 将模型和配置文件放置在 `pretrain/{微调声码器名称}/` 下 823 | 3. 修改对应模型的扩散模型配置文件`diffusion.yaml`中的如下内容: 824 | 825 | ```yaml 826 | vocoder: 827 | ckpt: pretrain/nsf_hifigan/model.ckpt # 此行是你的微调声码器的模型路径 828 | type: nsf-hifigan # 此行是微调声码器的类型,如果不知道请不要修改 829 | ``` 830 | 831 | 4. 按照 [3.2 webui 推理](#32-webui-推理),上传扩散模型和**修改后的扩散模型配置文件**,即可使用微调声码器。 832 | 833 | > [!WARNING] 834 | > 835 | > **目前仅 NSF-HiFiGAN 声码器支持微调** 836 | 837 | ## 4.5 各模型保存的目录 838 | 839 | 截止上文,一共能够训练的 4 种模型都已经讲完了,下表总结为这四种模型和配置文件的总结。 840 | 841 | webUI 中除了能够上传模型进行加载以外,也可以读取本地模型文件。你只需将下表这些模型先放入到一个文件夹内,再将该文件夹放到 trained 文件夹内,点击“刷新本地模型列表”,即可被 webUI 识别到。然后手动选择需要加载的模型进行加载即可。**注意**:本地模型自动加载可能无法正常加载下表中的(可选)模型。 842 | 843 | | 文件 | 文件名及后缀 | 存放位置 | 844 | | ------------------------ | ----------------------- | -------------------- | 845 | | So-VITS 模型 | `G_xxxx.pth` | `logs/44k` | 846 | | So-VITS 模型配置文件 | `config.json` | `configs` | 847 | | 扩散模型(可选) | `model_xxxx.pt` | `logs/44k/diffusion` | 848 | | 扩散模型配置文件(可选) | `diffusion.yaml` | `configs` | 849 | | Kmeans 聚类模型(可选) | `kmeans_10000.pt` | `logs/44k` | 850 | | 特征索引模型(可选) | `feature_and_index.pkl` | `logs/44k` | 851 | 852 | # 5.其他可选功能 853 | 854 | ✨ 此部分相较于前面的其他部分,重要性更低。除了 [5.1 模型压缩](#32-webui-推理) 是一个较为方便的功能以外,其余可选功能用到的概率较低,故此处仅参考官方文档并加以简单描述。 855 | 856 | ## 5.1 模型压缩 857 | 858 | 生成的模型含有继续训练所需的信息。如果**确认不再训练**,可以移除模型中此部分信息,得到约 1/3 大小的最终模型。 859 | 860 | 使用 compress_model.py 861 | 862 | ```bash 863 | # 例如,我想压缩一个在logs/44k/目录下名字为G_30400.pth的模型,并且配置文件为configs/config.json,可以运行如下命令 864 | python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth" 865 | # 压缩后的模型保存在logs/44k/release.pth 866 | ``` 867 | 868 | > [!WARNING] 869 | > 870 | > **注意:压缩后的模型无法继续训练!** 871 | 872 | ## 5.2 声线混合 873 | 874 | ### 5.2.1 静态声线混合 875 | 876 | **参考 `webUI.py` 文件中,小工具/实验室特性的静态声线融合。** 877 | 878 | 该功能可以将多个声音模型合成为一个声音模型(多个模型参数的凸组合或线性组合),从而制造出现实中不存在的声线。 879 | 880 | **注意:** 881 | 882 | 1. 该功能仅支持单说话人的模型 883 | 2. 如果强行使用多说话人模型,需要保证多个模型的说话人数量相同,这样可以混合同一个 SpaekerID 下的声音 884 | 3. 保证所有待混合模型的 config.json 中的 model 字段是相同的 885 | 4. 输出的混合模型可以使用待合成模型的任意一个 config.json,但聚类模型将不能使用 886 | 5. 批量上传模型的时候最好把模型放到一个文件夹选中后一起上传 887 | 6. 混合比例调整建议大小在 0-100 之间,也可以调为其他数字,但在线性组合模式下会出现未知的效果 888 | 7. 混合完毕后,文件将会保存在项目根目录中,文件名为 output.pth 889 | 8. 凸组合模式会将混合比例执行 Softmax 使混合比例相加为 1,而线性组合模式不会 890 | 891 | ### 5.2.2 动态声线混合 892 | 893 | **参考 `spkmix.py` 文件中关于动态声线混合的介绍** 894 | 895 | 角色混合轨道编写规则: 896 | 897 | - 角色 ID : \[\[起始时间 1, 终止时间 1, 起始数值 1, 起始数值 1], [起始时间 2, 终止时间 2, 起始数值 2, 起始数值 2]] 898 | - 起始时间和前一个的终止时间必须相同,第一个起始时间必须为 0,最后一个终止时间必须为 1 (时间的范围为 0-1) 899 | - 全部角色必须填写,不使用的角色填\[\[0., 1., 0., 0.]]即可 900 | - 融合数值可以随便填,在指定的时间段内从起始数值线性变化为终止数值,内部会自动确保线性组合为 1(凸组合条件),可以放心使用 901 | 902 | 命令行推理的时候使用 `--use_spk_mix` 参数即可启用动态声线混合。webUI 推理时勾选“动态声线融合”选项框即可。 903 | 904 | ## 5.3 Onnx 导出 905 | 906 | 使用 onnx_export.py。目前 onnx 模型只有 [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) 需要使用到。更详细的操作和使用方法请移步 [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) 仓库说明。 907 | 908 | - 新建文件夹:`checkpoints` 并打开 909 | - 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer` 910 | - 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下 911 | - 将 onnx_export.py 中`path = "NyaruTaffy"` 的 `"NyaruTaffy"` 修改为你的项目名称,`path = "aziplayer" (onnx_export_speaker_mix,为支持角色混合的onnx导出)` 912 | - 运行 `python onnx_export.py` 913 | - 等待执行完毕,在你的项目文件夹下会生成一个`model.onnx`,即为导出的模型 914 | 915 | 注意:Hubert Onnx 模型请使用 [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) 提供的模型,目前无法自行导出(fairseq 中 Hubert 有不少 onnx 不支持的算子和涉及到常量的东西,在导出时会报错或者导出的模型输入输出 shape 和结果都有问题) 916 | 917 | # 6. 简单混音处理及成品导出 918 | 919 | 使用音频宿主软件处理推理后音频,请参考[配套视频教程](https://www.bilibili.com/video/BV1Hr4y197Cy/)或其他更专业的混音教程。 920 | 921 | # 附录:常见报错的解决办法 922 | 923 | ✨ **部分报错及解决方法,来自 [羽毛布団](https://space.bilibili.com/3493141443250876) 大佬的 [相关专栏](https://www.bilibili.com/read/cv22206231) | [相关文档](https://www.yuque.com/umoubuton/ueupp5/ieinf8qmpzswpsvr)** 924 | 925 | ## 关于爆显存 926 | 927 | 如果你在终端或 WebUI 界面的报错中出现了这样的报错: 928 | 929 | ```bash 930 | OutOfMemoryError: CUDA out of memory.Tried to allocate XX GiB (GPU O: XX GiB total capacity; XX GiB already allocated; XX MiB Free: XX GiB reserved in total by PyTorch) 931 | ``` 932 | 933 | 不要怀疑,你的显卡显存或虚拟内存不够用了。以下是 100%解决问题的解决方法,照着做必能解决。请不要再在各种地方提问这个问题了 934 | 935 | 1. 在报错中找到 XX GiB already allocated 之后,是否显示 0 bytes free,如果是 0 bytes free 那么看第 2,3,4 步,如果显示 XX MiB free 或者 XX GiB free,看第 5 步 936 | 2. 如果是预处理的时候爆显存: 937 | - 换用对显存占用友好的 f0 预测器 (友好度从高到低: pm >= harvest >= rmvpe ≈ fcpe >> crepe),建议首选 rmvpe 或 fcpe 938 | - 多进程预处理改为 1 939 | 3. 如果是训练的时候爆显存 940 | - 检查数据集有没有过长的切片(20 秒以上) 941 | - 调小批量大小 (batch size) 942 | - 更换一个占用低的项目 943 | - 去 AutoDL 等云算力平台上面租一张大显存的显卡跑 944 | 4. 如果是推理的时候爆显存: 945 | - 推理源 (千声) 不干净 (有残留的混响,伴奏,和声),导致自动切片切不开。提取干声最佳实践请参考 [UVR5 歌曲人声分离教程](https://www.bilibili.com/video/BV1F4421c7qU/) 946 | - 调大切片闽值 (比如-40 调成-30,再大就不建议了,你也不想唱一半就被切一刀吧) 947 | - 设置强制切片,从 60 秒开始尝试,每次减小 10 秒,直到能成功推理 948 | - 使用 cpu 推理,速度会很慢但是不会爆显存 949 | 5. 如果显示仍然有空余显存却还是爆显存了,是你的虚拟内存不够大,调整到至少 50G 以上 950 | 951 | ## 安装依赖时出现的相关报错 952 | 953 | **1. 安装 CUDA=11.7 的 Pytorch 时报错** 954 | 955 | ``` 956 | ERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9 957 | ``` 958 | 959 | 解决方法有两种: 960 | 961 | - 升级 python 至 3.10,其余操作不变。 962 | - (建议)保持 python 版本不变,先 `pip install networkx==3.0` 之后再进行 Pytorch 的安装。 963 | 964 | **2. 依赖找不到导致的无法安装** 965 | 966 | 出现**类似**以下报错时: 967 | 968 | ```bash 969 | ERROR: Could not find a version that satisfies the requirement librosa==0.9.1 (from versions: none) 970 | ERROR: No matching distribution found for librosa==0.9.1 971 | # 报错的主要特征是 972 | No matching distribution found for xxxxx 973 | Could not find a version that satisfies the requirement xxxx 974 | ``` 975 | 976 | 具体解决方法为:更换安装源。手动安装这一依赖时添加下载源,以下是两个常用的镜像源地址 977 | 978 | - 清华大学: 979 | - 阿里云: 980 | 981 | 使用 `pip install [包名称] -i [下载源地址]` ,例如我想在阿里源下载 librosa 这个依赖,并且要求依赖版本是 0.9.1,那么应该在 cmd 中输入以下命令: 982 | 983 | ```bash 984 | pip install librosa==0.9.1 -i http://mirrors.aliyun.com/pypi/simple 985 | ``` 986 | 987 | **3. pip 版本过高导致部分依赖无法安装** 988 | 989 | 2024.6.21 日,pip 更新到了 24.1 版本,仅仅使用`pip install --upgrade pip`会使得 pip 的版本更新到 24.1,而部分依赖需要使用 pip 23.0 才能安装,因此需要手动降级 pip 版本。目前已知:hydra-core,omegaconf,fastapi 会受到此影响,具体表现为安装时出现下面的错误: 990 | 991 | ```bash 992 | Please use pip<24.1 if you need to use this version. 993 | INFO: pip is looking at multiple versions of hydra-core to determine which version is compatible with other requirements. This could take a while. 994 | ERROR: Cannot install -r requirements.txt (line 20) and fairseq because these package versions have conflicting dependencies. 995 | 996 | The conflict is caused by: 997 | fairseq 0.12.2 depends on omegaconf<2.1 998 | hydra-core 1.0.7 depends on omegaconf<2.1 and >=2.0.5 999 | 1000 | To fix this you could try to: 1001 | 1. loosen the range of package versions you've specified 1002 | 2. remove package versions to allow pip to attempt to solve the dependency conflict 1003 | ``` 1004 | 1005 | 解决方法为:在[1.5 其他依赖项安装](#15-其他依赖项安装)时,先限制 pip 版本再进行依赖的安装,使用下面的命令限制 1006 | 1007 | ```bash 1008 | pip install --upgrade pip==23.3.2 wheel setuptools 1009 | ``` 1010 | 1011 | 运行完成后,再进行其他依赖项的安装。 1012 | 1013 | ## 数据集预处理和模型训练时的相关报错 1014 | 1015 | **1. 报错:`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position xx`** 1016 | 1017 | - 数据集文件名中不要包含中文或日文等非西文字符,特别注意**中文**括号,逗号,冒号,分号,引号等等都是不行的。改完名字**一定要**重新预处理,然后再进行训练!!! 1018 | 1019 | **2. 报错:`The expand size of the tensor (768) must match the existing size (256) at non-singleton dimension 0.`** 1020 | 1021 | - 把 dataset/44k 下的内容全部删了,重新走一遍预处理流程 1022 | 1023 | **3. 报错:RuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly** 1024 | 1025 | ```bash 1026 | raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e 1027 | RuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly 1028 | ``` 1029 | 1030 | - 调小 batchsize 值,调大虚拟内存,重启电脑清理显存,直到 batchsize 值和虚拟内存合适不报错为止 1031 | 1032 | **4. 报错:`torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 3221225477`** 1033 | 1034 | - 调大虚拟内存,调小 batchsize 值,直到 batchsize 值和虚拟内存合适不报错为止 1035 | 1036 | **5. 报错:`AssertionError: CPU training is not allowed.`** 1037 | 1038 | - 没有解决方法:非 N 卡跑不了。(也不是完全跑不了,但如果你是纯萌新的话,那我的回答确实就是:跑不了) 1039 | 1040 | **6. 报错:`FileNotFoundError: No such file or directory: 'pretrain/rmvpe.pt'`** 1041 | 1042 | - 运行`python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff` 后出现 `FileNotFoundError: No such file or directory: 'pretrain/rmvpe.pt'` 1043 | - 因为官方更新了 rmvpe 预处理器来处理 f0,请参考教程文档 [#2.2.3](#223-可选项-根据情况选择) 下载预处理模型`rmvpe.pt`并放到对应位置。 1044 | 1045 | **7. 报错:页面文件太小,无法完成操作。** 1046 | 1047 | - 调大虚拟内存,具体的方法各种地方一搜就能搜到,不展开了 1048 | 1049 | ## 使用 WebUI 时相关报错 1050 | 1051 | **1. webUI 启动或加载模型时**: 1052 | 1053 | - 启动 webUI 时报错:`ImportError: cannot import name 'Schema' from 'pydantic'` 1054 | - webUI 加载模型时报错:`AttributeError("'Dropdown' object has no attribute 'update'")` 1055 | - **凡是报错中涉及到 fastapi, gradio, pydantic 这三个依赖的报错** 1056 | 1057 | **解决方法**: 1058 | 1059 | - 需限制部分依赖版本,在安装完 `requirements_win.txt` 后,在 cmd 中依次输入以下命令以更新依赖包: 1060 | 1061 | ```bash 1062 | pip install --upgrade fastapi==0.84.0 1063 | pip install --upgrade gradio==3.41.2 1064 | pip install --upgrade pydantic==1.10.12 1065 | ``` 1066 | 1067 | **2. 报错:`Given groups=1, weight of size [xxx, 256, xxx], expected input[xxx, 768, xxx] to have 256 channels, but got 768 channels instead`** 1068 | 或 **报错: 配置文件中的编码器与模型维度不匹配** 1069 | 1070 | - 原因:v1 分支的模型用了 vec768 的配置文件,如果上面报错的 256 的 768 位置反过来了那就是 vec768 的模型用了 v1 的配置文件。 1071 | - 解决方法:检查配置文件中的 `ssl_dim` 一项,如果这项是 256,那你的 `speech_encoder` 应当修改为 `vec256|9`,如果是 768,则是 `vec768|12` 1072 | - 详细修改方法请参考 [#2.1](#21-关于兼容-40-模型的问题) 1073 | 1074 | **3. 报错:`'HParams' object has no attribute 'xxx'`** 1075 | 1076 | - 无法找到音色,一般是配置文件和模型没对应,打开配置文件拉到最下面看看有没有你训练的音色 1077 | 1078 | # 感谢名单 1079 | 1080 | - so-vits-svc | [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc) 1081 | - GPT-SoVITS | [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) 1082 | - SingingVocoders | [SingingVocoders](https://github.com/openvpi/SingingVocoders) 1083 | - MoeVoiceStudio | [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) 1084 | - [OpenVPI](https://github.com/openvpi) | [vocoders](https://github.com/openvpi/vocoders) 1085 | - up 主 [inifnite_loop](https://space.bilibili.com/286311429) | [相关视频](https://www.bilibili.com/video/BV1Bd4y1W7BN) | [相关专栏](https://www.bilibili.com/read/cv21425662) 1086 | - up 主 [羽毛布団](https://space.bilibili.com/3493141443250876) | [一些报错的解决办法](https://www.bilibili.com/read/cv22206231) | [常见报错解决方法](https://www.yuque.com/umoubuton/ueupp5/ieinf8qmpzswpsvr) 1087 | - 所有提供训练音频样本的人员 1088 | - 您 1089 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | # SoftVC VITS Singing Voice Conversion Local Deployment Tutorial 4 | [![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SUC-DriverOld/so-vits-svc-Deployment-Documents/blob/4.1/sovits4_for_colab.ipynb)
5 | English | [简体中文](README_zh_CN.md)
6 | This help document provides detailed installation, debugging, and inference tutorials for the project [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc). You can also directly refer to the official [README](https://github.com/svc-develop-team/so-vits-svc#readme) documentation.
7 | Written by Sucial. [Bilibili](https://space.bilibili.com/445022409) | [Github](https://github.com/SUC-DriverOld) 8 | 9 |
10 | 11 | --- 12 | 13 | ✨ **Click to view: [Accompanying Video Tutorial](https://www.bilibili.com/video/BV1Hr4y197Cy/) | [UVR5 Vocal Separation Tutorial](https://www.bilibili.com/video/BV1F4421c7qU/) (Note: The accompanying video may be outdated. Refer to the latest tutorial documentation for accurate information!)** 14 | 15 | ✨ **Related Resources: [Official README Documentation](https://github.com/svc-develop-team/so-vits-svc) | [Common Error Solutions](https://www.yuque.com/umoubuton/ueupp5/ieinf8qmpzswpsvr) | [羽毛布団](https://space.bilibili.com/3493141443250876)** 16 | 17 | > [!IMPORTANT] 18 | > 19 | > 1. **Important! Read this first!** If you do not want to configure the environment manually or are looking for an integration package, please use the integration package by [羽毛布団](https://space.bilibili.com/3493141443250876). 20 | > 2. **About old version tutorials**: For the so-vits-svc3.0 version tutorial, please switch to the [3.0 branch](https://github.com/SUC-DriverOld/so-vits-svc-Chinese-Detaild-Documents/tree/3.0). This branch is no longer being updated! 21 | > 3. **Continuous improvement of the documentation**: If you encounter errors not mentioned in this document, you can ask questions in the issues section. For project bugs, please report issues to the original project. If you want to improve this tutorial, feel free to submit a PR! 22 | 23 | # Tutorial Index 24 | 25 | - [SoftVC VITS Singing Voice Conversion Local Deployment Tutorial](#softvc-vits-singing-voice-conversion-local-deployment-tutorial) 26 | - [Tutorial Index](#tutorial-index) 27 | - [0. Before You Use](#0-before-you-use) 28 | - [Any country, region, organization, or individual using this project must comply with the following laws:](#any-country-region-organization-or-individual-using-this-project-must-comply-with-the-following-laws) 29 | - [《民法典》](#民法典) 30 | - [第一千零一十九条](#第一千零一十九条) 31 | - [第一千零二十四条](#第一千零二十四条) 32 | - [第一千零二十七条](#第一千零二十七条) 33 | - [《中华人民共和国宪法》|《中华人民共和国刑法》|《中华人民共和国民法典》|《中华人民共和国合同法》](#中华人民共和国宪法中华人民共和国刑法中华人民共和国民法典中华人民共和国合同法) 34 | - [0.1 Usage Regulations](#01-usage-regulations) 35 | - [0.2 Hardware Requirements](#02-hardware-requirements) 36 | - [0.3 Preparation](#03-preparation) 37 | - [1. Environment Dependencies](#1-environment-dependencies) 38 | - [1.1 so-vits-svc4.1 Source Code](#11-so-vits-svc41-source-code) 39 | - [1.2 Python](#12-python) 40 | - [1.3 Pytorch](#13-pytorch) 41 | - [1.4 Installation of Other Dependencies](#14-installation-of-other-dependencies) 42 | - [1.5 FFmpeg](#15-ffmpeg) 43 | - [2. Configuration and Training](#2-configuration-and-training) 44 | - [2.1 Issues Regarding Compatibility with the 4.0 Model](#21-issues-regarding-compatibility-with-the-40-model) 45 | - [2.2 Pre-downloaded Model Files](#22-pre-downloaded-model-files) 46 | - [2.2.1 Mandatory Items](#221-mandatory-items) 47 | - [Detailed Explanation of Each Encoder](#detailed-explanation-of-each-encoder) 48 | - [2.2.2 Pre-trained Base Model (Strongly Recommended)](#222-pre-trained-base-model-strongly-recommended) 49 | - [2.2.3 Optional Items (Choose as Needed)](#223-optional-items-choose-as-needed) 50 | - [2.3 Data Preparation](#23-data-preparation) 51 | - [2.4 Data Preprocessing](#24-data-preprocessing) 52 | - [2.4.0 Audio Slicing](#240-audio-slicing) 53 | - [2.4.1 Resampling to 44100Hz Mono](#241-resampling-to-44100hz-mono) 54 | - [2.4.2 Automatic Dataset Splitting and Configuration File Generation](#242-automatic-dataset-splitting-and-configuration-file-generation) 55 | - [Using Loudness Embedding](#using-loudness-embedding) 56 | - [2.4.3 Modify Configuration Files as Needed](#243-modify-configuration-files-as-needed) 57 | - [config.json](#configjson) 58 | - [diffusion.yaml](#diffusionyaml) 59 | - [2.4.3 Generating Hubert and F0](#243-generating-hubert-and-f0) 60 | - [Pros and Cons of Each F0 Predictor](#pros-and-cons-of-each-f0-predictor) 61 | - [2.5 Training](#25-training) 62 | - [2.5.1 Main Model Training (Required)](#251-main-model-training-required) 63 | - [2.5.2 Diffusion Model (Optional)](#252-diffusion-model-optional) 64 | - [2.5.3 Tensorboard](#253-tensorboard) 65 | - [3. Inference](#3-inference) 66 | - [3.1 Command-line Inference](#31-command-line-inference) 67 | - [3.2 webUI Inference](#32-webui-inference) 68 | - [4. Optional Enhancements](#4-optional-enhancements) 69 | - [4.1 Automatic F0 Prediction](#41-automatic-f0-prediction) 70 | - [4.2 Clustering Timbre Leakage Control](#42-clustering-timbre-leakage-control) 71 | - [4.3 Feature Retrieval](#43-feature-retrieval) 72 | - [4.4 Vocoder Fine-tuning](#44-vocoder-fine-tuning) 73 | - [4.5 Directories for Saved Models](#45-directories-for-saved-models) 74 | - [5. Other Optional Features](#5-other-optional-features) 75 | - [5.1 Model Compression](#51-model-compression) 76 | - [5.2 Voice Mixing](#52-voice-mixing) 77 | - [5.2.1 Static Voice Mixing](#521-static-voice-mixing) 78 | - [5.2.2 Dynamic Voice Mixing](#522-dynamic-voice-mixing) 79 | - [5.3 Onnx Export](#53-onnx-export) 80 | - [6. Simple Mixing and Exporting Finished Product](#6-simple-mixing-and-exporting-finished-product) 81 | - [Use Audio Host Software to Process Inferred Audio](#use-audio-host-software-to-process-inferred-audio) 82 | - [Appendix: Common Errors and Solutions](#appendix-common-errors-and-solutions) 83 | - [About Out of Memory (OOM)](#about-out-of-memory-oom) 84 | - [Common Errors and Solutions When Installing Dependencies](#common-errors-and-solutions-when-installing-dependencies) 85 | - [Common Errors During Dataset Preprocessing and Model Training](#common-errors-during-dataset-preprocessing-and-model-training) 86 | - [Errors When Using WebUI\*\*](#errors-when-using-webui) 87 | - [Acknowledgements](#acknowledgements) 88 | 89 | # 0. Before You Use 90 | 91 | ### Any country, region, organization, or individual using this project must comply with the following laws: 92 | 93 | #### 《[民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》 94 | 95 | #### 第一千零一十九条 96 | 97 | 任何组织或者个人**不得**以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。**未经**肖像权人同意,**不得**制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。**未经**肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。 98 | **对自然人声音的保护,参照适用肖像权保护的有关规定** 99 | 100 | #### 第一千零二十四条 101 | 102 | 【名誉权】民事主体享有名誉权。任何组织或者个人**不得**以侮辱、诽谤等方式侵害他人的名誉权。 103 | 104 | #### 第一千零二十七条 105 | 106 | 【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。 107 | 108 | #### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》|《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》|《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》|《[中华人民共和国合同法](http://www.npc.gov.cn/zgrdw/npc/lfzt/rlyw/2016-07/01/content_1992739.htm)》 109 | 110 | ## 0.1 Usage Regulations 111 | 112 | > [!WARNING] 113 | > 114 | > 1. **This tutorial is for communication and learning purposes only. Do not use it for illegal activities, violations of public order, or other unethical purposes. Out of respect for the providers of audio sources, do not use this for inappropriate purposes.** 115 | > 2. **Continuing to use this tutorial implies agreement with the related regulations described herein. This tutorial fulfills its obligation to provide guidance and is not responsible for any subsequent issues that may arise.** 116 | > 3. **Please resolve dataset authorization issues yourself. Do not use unauthorized datasets for training! Any issues arising from the use of unauthorized datasets are your own responsibility and have no connection to the repository, the repository maintainers, the svc develop team, or the tutorial authors.** 117 | 118 | Specific usage regulations are as follows: 119 | 120 | - The content of this tutorial represents personal views only and does not represent the views of the so-vits-svc team or the original authors. 121 | - This tutorial assumes the use of the repository maintained by the so-vits-svc team. Please comply with the open-source licenses of any open-source code involved. 122 | - Any videos based on sovits made and posted on video platforms must clearly indicate in the description the source of the input vocals or audio used for the voice converter. For example, if using someone else's video or audio as the input source after vocal separation, a clear link to the original video or music must be provided. If using your own voice or audio synthesized by other vocal synthesis engines, this must also be indicated in the description. 123 | - Ensure the data sources used to create datasets are legal and compliant, and that data providers are aware of what you are creating and the potential consequences. You are solely responsible for any infringement issues arising from the input sources. When using other commercial vocal synthesis software as input sources, ensure you comply with the software's usage terms. Note that many vocal synthesis engine usage terms explicitly prohibit using them as input sources for conversion! 124 | - Cloud training and inference may involve financial costs. If you are a minor, please obtain permission and understanding from your guardian before proceeding. This tutorial is not responsible for any subsequent issues arising from unauthorized use. 125 | - Local training (especially on less powerful hardware) may require prolonged high-load operation of the device. Ensure proper maintenance and cooling measures for your device. 126 | - Due to equipment reasons, this tutorial has only been tested on Windows systems. For Mac and Linux, ensure you have some problem-solving capability. 127 | - Continuing to use this repository implies agreement with the related regulations described in the README. This README fulfills its obligation to provide guidance and is not responsible for any subsequent issues that may arise. 128 | 129 | ## 0.2 Hardware Requirements 130 | 131 | 1. Training **must** be conducted using a GPU! For inference, which can be done via **command-line inference** or **WebUI inference**, either a CPU or GPU can be used if speed is not a primary concern. 132 | 2. If you plan to train your own model, prepare an **NVIDIA graphics card with at least 6GB of dedicated memory**. 133 | 3. Ensure your computer's virtual memory is set to **at least 30GB**, and it is best if it is set on an SSD, otherwise it will be very slow. 134 | 4. For cloud training, it is recommended to use [Google Colab](https://colab.google/), you can configure it according to our provided [sovits4_for_colab.ipynb](./sovits4_for_colab.ipynb). 135 | 136 | ## 0.3 Preparation 137 | 138 | 1. Prepare at least 30 minutes (the more, the better!) of **clean vocals** as your training set, with **no background noise and no reverb**. It is best to maintain a **consistent timbre** while singing, ensure a **wide vocal range (the vocal range of the training set determines the range of the trained model!)**, and have an **appropriate loudness**. If possible, perform **loudness matching** using audio processing software such as Audition. 139 | 2. **Important!** Download the necessary **base model** for training in advance. Refer to [2.2.2 Pre-trained Base Model](#222-pre-trained-baseline-models-highly-recommended). 140 | 3. For inference: Prepare **dry vocals** with **background noise <30dB** and preferably **without reverb or harmonies**. 141 | 142 | > [!NOTE] 143 | > 144 | > **Note 1**: Both singing and speaking can be used as training sets, but using speech may lead to **issues with high and low notes during inference (commonly known as range issues/muted sound)**, as the vocal range of the training set largely determines the vocal range of the trained model. Therefore, if your final goal is to achieve singing, it is recommended to use singing vocals as your training set. 145 | > 146 | > **Note 2**: When using a male voice model to infer songs sung by female singers, if there is noticeable muting, try lowering the pitch (usually by 12 semitones, or one octave). Similarly, when using a female voice model to infer songs sung by male singers, you can try raising the pitch. 147 | > 148 | > **✨ Latest Recommendation as of 2024.3.8 ✨**: Currently, the [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) project's TTS (Text-to-Speech) compared to so-vits-svc's TTS requires a smaller training set, has faster training speed, and yields better results. Therefore, if you want to use the speech synthesis function, please switch to [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS). Consequently, it is recommended to use singing vocals as the training set for this project. 149 | 150 | # 1. Environment Dependencies 151 | 152 | ✨ **Required environment for this project**: [NVIDIA-CUDA](https://developer.nvidia.com/cuda-toolkit) | [Python](https://www.python.org/) = 3.8.9 (this version is recommended) | [Pytorch](https://pytorch.org/get-started/locally/) | [FFmpeg](https://ffmpeg.org/) 153 | 154 | ✨ **You can also try using my script for one-click environment setup and webUI launch: [so-vits-svc-webUI-QuickStart-bat](https://github.com/SUC-DriverOld/so-vits-svc-webUI-QuickStart-bat)** 155 | 156 | ## 1.1 so-vits-svc4.1 Source Code 157 | 158 | You can download or clone the source code using one of the following methods: 159 | 160 | 1. **Download the source code ZIP file from the Github project page**: Go to the [so-vits-svc official repository](https://github.com/svc-develop-team/so-vits-svc), click the green `Code` button at the top right, and select `Download ZIP` to download the compressed file. If you need the code from another branch, switch to that branch first. After downloading, extract the ZIP file to any directory, which will serve as your working directory. 161 | 162 | 2. **Clone the source code using git**: Use the following command: 163 | 164 | ```bash 165 | git clone https://github.com/svc-develop-team/so-vits-svc.git 166 | ``` 167 | 168 | ## 1.2 Python 169 | 170 | - Go to the [Python official website](https://www.python.org/) to download Python 3.8 and **add it to the system environment PATH**. Detailed installation methods and adding Path are omitted here, as they can be easily found online. 171 | 172 | ```bash 173 | # Conda configuration method, replace YOUR_ENV_NAME with the name of the virtual environment you want to create. 174 | conda create -n YOUR_ENV_NAME python=3.8 -y 175 | conda activate YOUR_ENV_NAME 176 | # Ensure you are in this virtual environment before executing any commands! 177 | ``` 178 | 179 | - After installation, enter `python` in the command prompt. If the output is similar to the following, the installation was successful: 180 | 181 | ```bash 182 | Python 3.8.9 (tags/v3.8.9:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] on win32 183 | Type "help", "copyright", "credits" or "license" for more information. 184 | >>> 185 | ``` 186 | 187 | **Regarding the Python version**: After testing, we found that Python 3.8.9 can stably run this project (though higher versions may also work). 188 | 189 | ## 1.3 Pytorch 190 | 191 | > [!IMPORTANT] 192 | > 193 | > ✨ We highly recommend you to install Pytorch 11.7 or 11.8. The version over 12.0 may not be compatible with the current project. 194 | 195 | - We need to **separately install** `torch`, `torchaudio`, `torchvision` libraries. Go directly to the [Pytorch official website](https://pytorch.org/get-started/locally/), choose the desired version, and copy the command displayed in the "Run this Command" section to the console to install. You can download older versions of Pytorch from [here](https://pytorch.org/get-started/previous-versions/). 196 | 197 | - After installing `torch`, `torchaudio`, `torchvision`, use the following command in the cmd console to check if torch can successfully call CUDA. If the last line shows `True`, it's successful; if it shows `False`, it's unsuccessful and you need to reinstall the correct version. 198 | 199 | ```bash 200 | python 201 | # Press Enter to run 202 | import torch 203 | # Press Enter to run 204 | print(torch.cuda.is_available()) 205 | # Press Enter to run 206 | ``` 207 | 208 | > [!NOTE] 209 | > 210 | > 1. If you need to specify the version of `torch` manually, simply add the version number afterward. For example, `pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117`. 211 | > 2. When installing Pytorch for CUDA 11.7, you may encounter an error `ERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9'`. In this case, first execute `pip install networkx==3.0`, and then proceed with the Pytorch installation to avoid similar errors. 212 | > 3. Due to version updates, you may not be able to copy the download link for Pytorch 11.7. In this case, you can directly copy the installation command below to install Pytorch 11.7. Alternatively, you can download older versions from [here](https://pytorch.org/get-started/previous-versions/). 213 | 214 | ```bash 215 | pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117 216 | ``` 217 | 218 | ## 1.4 Installation of Other Dependencies 219 | 220 | > [!IMPORTANT] 221 | > ✨ Before starting the installation of other dependencies, **make sure to download and install** [Visual Studio 2022](https://visualstudio.microsoft.com/) or [Microsoft C++ Build Tools](https://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/) (the latter has a smaller size). **Select and install the component package: "Desktop development with C++"**, then execute the modification and wait for the installation to complete. 222 | 223 | - Right-click in the folder obtained from [1.1](#11-so-vits-svc41-source-code) and select **Open in Terminal**. Use the following command to first update `pip`, `wheel`, and `setuptools`. 224 | 225 | ```bash 226 | pip install --upgrade pip==23.3.2 wheel setuptools 227 | ``` 228 | 229 | - Execute the following command to install libraries (**if errors occur, please try multiple times until there are no errors, and all dependencies are installed**). Note that there are three `requirements` txt files in the project folder; here, select `requirements_win.txt`. 230 | 231 | ```bash 232 | pip install -r requirements_win.txt 233 | ``` 234 | 235 | - After ensuring the installation is **correct and error-free**, use the following command to update `fastapi`, `gradio`, and `pydantic` dependencies: 236 | 237 | ```bash 238 | pip install --upgrade fastapi==0.84.0 239 | pip install --upgrade pydantic==1.10.12 240 | pip install --upgrade gradio==3.41.2 241 | ``` 242 | 243 | ## 1.5 FFmpeg 244 | 245 | - Go to the [FFmpeg official website](https://ffmpeg.org/) to download FFmpeg. Unzip it to any location and add the path to the environment variables. Navigate to `.\ffmpeg\bin` (detailed installation methods and adding Path are omitted here, as they can be easily found online). 246 | 247 | - After installation, enter `ffmpeg -version` in the cmd console. If the output is similar to the following, the installation was successful: 248 | 249 | ```bash 250 | ffmpeg version git-2020-08-12-bb59bdb Copyright (c) 2000-2020 the FFmpeg developers 251 | built with gcc 10.2.1 (GCC) 20200805 252 | configuration: a bunch of configuration details here 253 | libavutil 56. 58.100 / 56. 58.100 254 | libavcodec 58.100.100 / 58.100.100 255 | ... 256 | ``` 257 | 258 | # 2. Configuration and Training 259 | 260 | ✨ This section is the most crucial part of the entire tutorial document. It references the [official documentation](https://github.com/svc-develop-team/so-vits-svc#readme) and includes some explanations and clarifications for better understanding. 261 | 262 | ✨ Before diving into the content of the second section, please ensure that your computer's virtual memory is set to **30GB or above**, preferably on a solid-state drive (SSD). You can search online for specific instructions on how to do this. 263 | 264 | ## 2.1 Issues Regarding Compatibility with the 4.0 Model 265 | 266 | - You can ensure support for the 4.0 model by modifying the `config.json` of the 4.0 model. You need to add the `speech_encoder` field under the `model` section in the `config.json`, as shown below: 267 | 268 | ```bash 269 | "model": 270 | { 271 | # Other contents omitted 272 | 273 | # "ssl_dim", fill in either 256 or 768, which should match the value below "speech_encoder" 274 | "ssl_dim": 256, 275 | # Number of speakers 276 | "n_speakers": 200, 277 | # or "vec768l12", but please note that the value here should match "ssl_dim" above. That is, 256 corresponds to vec256l9, and 768 corresponds to vec768l12. 278 | "speech_encoder":"vec256l9" 279 | # If you're unsure whether your model is vec768l12 or vec256l9, you can confirm by checking the value of the "gin_channels" field. 280 | 281 | # Other contents omitted 282 | } 283 | ``` 284 | 285 | ## 2.2 Pre-downloaded Model Files 286 | 287 | ### 2.2.1 Mandatory Items 288 | 289 | > [!WARNING] 290 | > 291 | > **You must select one of the following encoders to use:** 292 | > 293 | > - "vec768l12" 294 | > - "vec256l9" 295 | > - "vec256l9-onnx" 296 | > - "vec256l12-onnx" 297 | > - "vec768l9-onnx" 298 | > - "vec768l12-onnx" 299 | > - "hubertsoft-onnx" 300 | > - "hubertsoft" 301 | > - "whisper-ppg" 302 | > - "cnhubertlarge" 303 | > - "dphubert" 304 | > - "whisper-ppg-large" 305 | > - "wavlmbase+" 306 | 307 | | Encoder | Download Link | Location | Description | 308 | | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------- | 309 | | contentvec (Recommended) | [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr) | Place in `pretrain` directory | `vec768l12` and `vec256l9` require this encoder | 310 | | | [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) | Rename to checkpoint_best_legacy_500.pt, then place in `pretrain` directory | Same effect as the above `checkpoint_best_legacy_500.pt` but only 199MB | 311 | | hubertsoft | [hubert-soft-0d54a1f4.pt](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt) | Place in `pretrain` directory | Used by so-vits-svc3.0 | 312 | | Whisper-ppg | [medium.pt](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt) | Place in `pretrain` directory | Compatible with `whisper-ppg` | 313 | | whisper-ppg-large | [large-v2.pt](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt) | Place in `pretrain` directory | Compatible with `whisper-ppg-large` | 314 | | cnhubertlarge | [chinese-hubert-large-fairseq-ckpt.pt](https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt) | Place in `pretrain` directory | - | 315 | | dphubert | [DPHuBERT-sp0.75.pth](https://huggingface.co/pyf98/DPHuBERT/resolve/main/DPHuBERT-sp0.75.pth) | Place in `pretrain` directory | - | 316 | | WavLM | [WavLM-Base+.pt](https://valle.blob.core.windows.net/share/wavlm/WavLM-Base+.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D) | Place in `pretrain` directory | Download link might be problematic, unable to download | 317 | | OnnxHubert/ContentVec | [MoeSS-SUBModel](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main) | Place in `pretrain` directory | - | 318 | 319 | #### Detailed Explanation of Each Encoder 320 | 321 | | Encoder Name | Advantages | Disadvantages | 322 | | ------------------------------ | ------------------------------------------------------------------ | --------------------------------- | 323 | | `vec768l12` (Most Recommended) | Best voice fidelity, large base model, supports loudness embedding | Weak articulation | 324 | | `vec256l9` | No particular advantages | Does not support diffusion models | 325 | | `hubertsoft` | Strong articulation | Voice leakage | 326 | | `whisper-ppg` | Strongest articulation | Voice leakage, high VRAM usage | 327 | 328 | ### 2.2.2 Pre-trained Base Model (Strongly Recommended) 329 | 330 | - Pre-trained base model files: `G_0.pth`, `D_0.pth`. Place in the `logs/44k` directory. 331 | 332 | - Diffusion model pre-trained base model file: `model_0.pt`. Place in the `logs/44k/diffusion` directory. 333 | 334 | The diffusion model references the Diffusion Model from [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC), and the base model is compatible with the diffusion model from [DDSP-SVC](https://github.com/yxlllc/DDSP-SVC). Some of the provided base model files are from the integration package of “[羽毛布団](https://space.bilibili.com/3493141443250876)”, to whom we express our gratitude. 335 | 336 | **Provide 4.1 training base models, please download them yourself (requires external network conditions)** 337 | 338 | | Encoder Type | Main Model Base | Diffusion Model Base | Description | 339 | | ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 340 | | vec768l12 | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/model_0.pt) | If only training for 100 steps diffusion, i.e., `k_step_max = 100`, use [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/max100/model_0.pt) for the diffusion model | 341 | | vec768l12 (with loudness embedding) | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/vol_emb/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec768l12/vol_emb/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/model_0.pt) | If only training for 100 steps diffusion, i.e., `k_step_max = 100`, use [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/768l12/max100/model_0.pt) for the diffusion model | 342 | | vec256l9 | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec256l9/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/vec256l9/D_0.pth) | Not supported | - | 343 | | hubertsoft | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/hubertsoft/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/hubertsoft/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/hubertsoft/model_0.pt) | - | 344 | | whisper-ppg | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/whisper-ppg/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/whisper-ppg/D_0.pth) | [model_0.pt](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/diffusion/whisper-ppg/model_0.pt) | - | 345 | | tiny (vec768l12_vol_emb) | [G_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/tiny/vec768l12_vol_emb/G_0.pth), [D_0.pth](https://huggingface.co/Sucial/so-vits-svc4.1-pretrain_model/blob/main/tiny/vec768l12_vol_emb/D_0.pth) | - | TINY is based on the original So-VITS model with reduced network parameters, using Depthwise Separable Convolution and FLOW shared parameter technology, significantly reducing model size and improving inference speed. TINY is designed for real-time conversion; reduced parameters mean its conversion effect is theoretically inferior to the original model. Real-time conversion GUI for So-VITS is under development. Until then, if there's no special need, training TINY model is not recommended. | 346 | 347 | > [!WARNING] 348 | > 349 | > Pre-trained models for other encoders not mentioned are not provided. Please train without base models, which may significantly increase training difficulty! 350 | 351 | **Base Model and Support** 352 | 353 | | Standard Base | Loudness Embedding | Loudness Embedding + TINY | Full Diffusion | 100-Step Shallow Diffusion | 354 | | ------------- | ------------------ | ------------------------- | -------------- | -------------------------- | 355 | | Vec768L12 | Supported | Supported | Supported | Supported | 356 | | Vec256L9 | Supported | Not Supported | Not Supported | Not Supported | 357 | | hubertsoft | Supported | Not Supported | Supported | Not Supported | 358 | | whisper-ppg | Supported | Not Supported | Supported | Not Supported | 359 | 360 | ### 2.2.3 Optional Items (Choose as Needed) 361 | 362 | **1. NSF-HIFIGAN** 363 | 364 | If using the `NSF-HIFIGAN enhancer` or `shallow diffusion`, you need to download the pre-trained NSF-HIFIGAN model provided by [OpenVPI]. If not needed, you can skip this. 365 | 366 | - Pre-trained NSF-HIFIGAN vocoder: 367 | - Version 2022.12: [nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip); 368 | - Version 2024.02: [nsf_hifigan_44.1k_hop512_128bin_2024.02.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-44.1k-hop512-128bin-2024.02/nsf_hifigan_44.1k_hop512_128bin_2024.02.zip) 369 | - After extracting, place the four files in the `pretrain/nsf_hifigan` directory. 370 | - If you download version 2024.02 of the vocoder, rename `model.ckpt` to `model`, i.e., remove the file extension. 371 | 372 | **2. RMVPE** 373 | 374 | If using the `rmvpe` F0 predictor, you need to download the pre-trained RMVPE model. 375 | 376 | - Download the model [rmvpe.zip](https://github.com/yxlllc/RMVPE/releases/download/230917/rmvpe.zip), which is currently recommended. 377 | - Extract `rmvpe.zip`, rename the `model.pt` file to `rmvpe.pt`, and place it in the `pretrain` directory. 378 | 379 | **3. FCPE (Preview Version)** 380 | 381 | [FCPE](https://github.com/CNChTu/MelPE) (Fast Context-based Pitch Estimator) is a new F0 predictor developed independently by svc-develop-team, designed specifically for real-time voice conversion. It will become the preferred F0 predictor for Sovits real-time voice conversion in the future. 382 | 383 | If using the `fcpe` F0 predictor, you need to download the pre-trained FCPE model. 384 | 385 | - Download the model [fcpe.pt](https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/fcpe.pt). 386 | - Place it in the `pretrain` directory. 387 | 388 | ## 2.3 Data Preparation 389 | 390 | 1. Organize the dataset into the `dataset_raw` directory according to the following file structure. 391 | 392 | ``` 393 | dataset_raw 394 | ├───speaker0 395 | │ ├───xxx1-xxx1.wav 396 | │ ├───... 397 | │ └───Lxx-0xx8.wav 398 | └───speaker1 399 | ├───xx2-0xxx2.wav 400 | ├───... 401 | └───xxx7-xxx007.wav 402 | ``` 403 | 404 | 2. You can customize the names of the speakers. 405 | 406 | ``` 407 | dataset_raw 408 | └───suijiSUI 409 | ├───1.wav 410 | ├───... 411 | └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav 412 | ``` 413 | 414 | 3. Additionally, you need to create and edit `config.json` in `dataset_raw` 415 | 416 | ```json 417 | { 418 | "n_speakers": 10, 419 | 420 | "spk": { 421 | "speaker0": 0, 422 | "speaker1": 1 423 | } 424 | } 425 | ``` 426 | 427 | - `"n_speakers": 10`: The number represents the number of speakers, starting from 1, and needs to correspond to the number below 428 | - `"speaker0": 0`: "speaker0" refers to the speaker's name, which can be changed. The numbers 0, 1, 2... represent the speaker count, starting from 0. 429 | 430 | ## 2.4 Data Preprocessing 431 | 432 | ### 2.4.0 Audio Slicing 433 | 434 | - Slice the audio into `5s - 15s` segments. Slightly longer segments are acceptable, but excessively long segments may lead to out-of-memory errors during training or preprocessing. 435 | 436 | - You can use [audio-slicer-GUI](https://github.com/flutydeer/audio-slicer) or [audio-slicer-CLI](https://github.com/openvpi/audio-slicer) for assistance in slicing. Generally, you only need to adjust the `Minimum Interval`. For regular speech material, the default value is usually sufficient, while for singing material, you may adjust it to `100` or even `50`. 437 | 438 | - After slicing, manually handle audio that is too long (over 15 seconds) or too short (under 4 seconds). Short audio can be concatenated into multiple segments, while long audio can be manually split. 439 | 440 | > [!WARNING] 441 | > 442 | > **If you are training with the Whisper-ppg sound encoder, all slices must be less than 30s in length.** 443 | 444 | ### 2.4.1 Resampling to 44100Hz Mono 445 | 446 | Use the following command (skip this step if loudness matching has already been performed): 447 | 448 | ```bash 449 | python resample.py 450 | ``` 451 | 452 | > [!NOTE] 453 | > 454 | > Although this project provides a script `resample.py` for resampling, converting to mono, and loudness matching, the default loudness matching matches to 0db, which may degrade audio quality. Additionally, the loudness normalization package `pyloudnorm` in Python cannot apply level limiting, which may lead to clipping. It is recommended to consider using professional audio processing software such as `Adobe Audition` for loudness matching. You can also use a loudness matching tool I developed, [Loudness Matching Tool](https://github.com/AI-Hobbyist/Loudness-Matching-Tool). If you have already performed loudness matching with other software, you can add `--skip_loudnorm` when running the above command to skip the loudness matching step. For example: 455 | 456 | ```bash 457 | python resample.py --skip_loudnorm 458 | ``` 459 | 460 | ### 2.4.2 Automatic Dataset Splitting and Configuration File Generation 461 | 462 | Use the following command (skip this step if loudness embedding is required): 463 | 464 | ```bash 465 | python preprocess_flist_config.py --speech_encoder vec768l12 466 | ``` 467 | 468 | The `speech_encoder` parameter has seven options, as explained in **[2.2.1 Required Items and Explanation of Each Encoder](#detailed-explanation-of-each-encoder)**. If you omit the `speech_encoder` parameter, the default value is `vec768l12`. 469 | 470 | ``` 471 | vec768l12 472 | vec256l9 473 | hubertsoft 474 | whisper-ppg 475 | whisper-ppg-large 476 | cnhubertlarge 477 | dphubert 478 | ``` 479 | 480 | #### Using Loudness Embedding 481 | 482 | - When using loudness embedding, the trained model will match the loudness of the input source. Otherwise, it will match the loudness of the training set. 483 | - If using loudness embedding, you need to add the `--vol_aug` parameter, for example: 484 | 485 | ```bash 486 | python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug 487 | ``` 488 | 489 | ### 2.4.3 Modify Configuration Files as Needed 490 | 491 | #### config.json 492 | 493 | - `vocoder_name`: Select a vocoder, default is `nsf-hifigan`. 494 | - `log_interval`: How often to output logs, default is `200`. 495 | - `eval_interval`: How often to perform validation and save the model, default is `800`. 496 | - `epochs`: Total number of training epochs, default is `10000`. Training will automatically stop after reaching this number of epochs. 497 | - `learning_rate`: Learning rate, it's recommended to keep the default value. 498 | - `batch_size`: The amount of data loaded onto the GPU for each training step, adjust to a size lower than the GPU memory capacity. 499 | - `all_in_mem`: Load all dataset into memory. Enable this if disk IO is too slow on some platforms and the memory capacity is much larger than the dataset size. 500 | - `keep_ckpts`: Number of recent models to keep during training, `0` to keep all. Default is to keep only the last `3` models. 501 | 502 | **Vocoder Options** 503 | 504 | ``` 505 | nsf-hifigan 506 | nsf-snake-hifigan 507 | ``` 508 | 509 | #### diffusion.yaml 510 | 511 | - `cache_all_data`: Load all dataset into memory. Enable this if disk IO is too slow on some platforms and the memory capacity is much larger than the dataset size. 512 | - `duration`: Duration of audio slices during training. Adjust according to GPU memory size. **Note: This value must be less than the shortest duration of audio in the training set!** 513 | - `batch_size`: The amount of data loaded onto the GPU for each training step, adjust to a size lower than the GPU memory capacity. 514 | - `timesteps`: Total steps of the diffusion model, default is 1000. A complete Gaussian diffusion has a total of 1000 steps. 515 | - `k_step_max`: During training, only `k_step_max` steps of diffusion can be trained to save training time. Note that this value must be less than `timesteps`. `0` means training the entire diffusion model. **Note: If not training the entire diffusion model, the model can't be used for inference with only the diffusion model!** 516 | 517 | ### 2.4.3 Generating Hubert and F0 518 | 519 | Use the following command (skip this step if training shallow diffusion): 520 | 521 | ```bash 522 | # The following command uses rmvpe as the f0 predictor, you can manually modify it 523 | python preprocess_hubert_f0.py --f0_predictor rmvpe 524 | ``` 525 | 526 | The `f0_predictor` parameter has six options, and some F0 predictors require downloading additional preprocessing models. Please refer to **[2.2.3 Optional Items (Choose According to Your Needs)](#223-optional-items-choose-as-needed)** for details. 527 | 528 | ``` 529 | crepe 530 | dio 531 | pm 532 | harvest 533 | rmvpe (recommended!) 534 | fcpe 535 | ``` 536 | 537 | #### Pros and Cons of Each F0 Predictor 538 | 539 | | Predictor | Pros | Cons | 540 | | --------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ | 541 | | pm | Fast, low resource consumption | Prone to producing breathy voice | 542 | | crepe | Rarely produces breathy voice | High memory consumption, may produce out-of-tune voice | 543 | | dio | - | May produce out-of-tune voice | 544 | | harvest | Better performance in the lower pitch range | Inferior performance in other pitch ranges | 545 | | rmvpe | Almost flawless, currently the most accurate predictor | Virtually no drawbacks (extremely low-pitched sounds may be problematic) | 546 | | fcpe | Developed by the SVC team, currently the fastest predictor, with accuracy comparable to crepe | - | 547 | 548 | > [!NOTE] 549 | > 550 | > 1. If the training set is too noisy, use crepe to process f0. 551 | > 2. If you omit the `f0_predictor` parameter, the default value is rmvpe. 552 | 553 | **If shallow diffusion functionality is required (optional), add the --use_diff parameter, for example:** 554 | 555 | ```bash 556 | # The following command uses rmvpe as the f0 predictor, you can manually modify it 557 | python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff 558 | ``` 559 | 560 | **If the processing speed is slow, or if your dataset is large, you can add the `--num_processes` parameter:** 561 | 562 | ```bash 563 | # The following command uses rmvpe as the f0 predictor, you can manually change it 564 | python preprocess_hubert_f0.py --f0_predictor rmvpe --num_processes 8 565 | # All workers will be automatically assigned to multiple threads 566 | ``` 567 | 568 | After completing the above steps, the `dataset` directory generated is the preprocessed data, and you can delete the `dataset_raw` folder as needed. 569 | 570 | ## 2.5 Training 571 | 572 | ### 2.5.1 Main Model Training (Required) 573 | 574 | Use the following command to train the main model. You can also use the same command to resume training if it pauses. 575 | 576 | ```bash 577 | python train.py -c configs/config.json -m 44k 578 | ``` 579 | 580 | ### 2.5.2 Diffusion Model (Optional) 581 | 582 | - A major update in So-VITS-SVC 4.1 is the introduction of Shallow Diffusion mechanism, which converts the original output audio of SoVITS into Mel spectrograms, adds noise, and performs shallow diffusion processing before outputting the audio through the vocoder. Testing has shown that **the quality of the output is significantly enhanced after the original output audio undergoes shallow diffusion processing, addressing issues such as electronic noise and background noise**. 583 | - If shallow diffusion functionality is required, you need to train the diffusion model. Before training, ensure that you have downloaded and correctly placed `NSF-HIFIGAN` (**refer to [2.2.3 Optional Items](#223-optional-items-choose-as-needed)**), and added the `--use_diff` parameter when preprocessing to generate Hubert and F0 (**refer to [2.4.3 Generating Hubert and F0](#243-generating-hubert-and-f0)**). 584 | 585 | To train the diffusion model, use the following command: 586 | 587 | ```bash 588 | python train_diff.py -c configs/diffusion.yaml 589 | ``` 590 | 591 | After the model training is complete, the model files are saved in the `logs/44k` directory, and the diffusion model is saved in `logs/44k/diffusion`. 592 | 593 | > [!IMPORTANT] 594 | > 595 | > **How do you know when the model is trained well?** 596 | > 597 | > 1. This is a very boring and meaningless question. It's like asking a teacher how to make your child study well. Except for yourself, no one can answer this question. 598 | > 2. The model training is related to the quality and duration of your dataset, the selected encoder, f0 algorithm, and even some supernatural mystical factors. Even if you have a finished model, the final conversion effect depends on your input source and inference parameters. This is not a linear process, and there are too many variables involved. So, if you have to ask questions like "Why doesn't my model look like it?" or "How do you know when the model is trained well?", I can only say WHO F\*\*KING KNOWS? 599 | > 3. But that doesn't mean there's no way. You just have to pray and worship. I don't deny that praying and worshiping is an effective method, but you can also use some scientific tools, such as Tensorboard, etc. [2.5.3 Tensorboard](#253-tensorboard) below will teach you how to use Tensorboard to assist in understanding the training status. **Of course, the most powerful tool is actually within yourself. How do you know when a acoustic model is trained well? Put on your headphones and let your ears tell you.** 600 | 601 | **Relationship between Epoch and Step**: 602 | 603 | During training, a model will be saved every specified number of steps (default is 800 steps, corresponding to the `eval_interval` value) based on the setting in your `config.json`. 604 | It's important to distinguish between epochs and steps: 1 Epoch means all samples in the training set have been involved in one learning process, while 1 Step means one learning step has been taken. Due to the existence of `batch_size`, each learning step can contain several samples. Therefore, the conversion between Epoch and Step is as follows: 605 | 606 | $$ 607 | Epoch = \frac{Step}{(\text{Number of samples in the dataset}{\div}batch\_size)} 608 | $$ 609 | 610 | The training will end after 10,000 epochs by default (you can increase or decrease the upper limit by modifying the value of the `epoch` field in `config.json`), but typically, good results can be achieved after a few hundred epochs. When you feel that training is almost complete, you can interrupt the training by pressing `Ctrl + C` in the training terminal. After interruption, as long as you haven't reprocessed the training set, you can **resume training from the most recent saved point**. 611 | 612 | ### 2.5.3 Tensorboard 613 | 614 | You can use Tensorboard to visualize the trends of loss function values during training, listen to audio samples, and assist in judging the training status of the model. **However, for the So-VITS-SVC project, the loss function values (loss) do not have practical reference significance (you don't need to compare or study the value itself), the real reference is still listening to the audio output after inference with your ears!** 615 | 616 | - Use the following command to open Tensorboard: 617 | 618 | ```bash 619 | tensorboard --logdir=./logs/44k 620 | ``` 621 | 622 | Tensorboard generates logs based on the default evaluation every 200 steps during training. If training has not reached 200 steps, no images will appear in Tensorboard. The value of 200 can be modified by changing the value of `log_interval` in `config.json`. 623 | 624 | - Explanation of Losses 625 | 626 | You don't need to understand the specific meanings of each loss. In general: 627 | 628 | - `loss/g/f0`, `loss/g/mel`, and `loss/g/total` should oscillate and eventually converge to some value. 629 | - `loss/g/kl` should oscillate at a low level. 630 | - `loss/g/fm` should continue to rise in the middle of training, and in the later stages, the upward trend should slow down or even start to decline. 631 | 632 | > [!IMPORTANT] 633 | > 634 | > ✨ Observing the trends of loss curves can help you judge the training status of the model. However, losses alone cannot be the sole criterion for judging the training status of the model, **and in fact, their reference value is not very significant. You still need to judge whether the model is trained well by listening to the audio output with your ears**. 635 | 636 | > [!WARNING] 637 | > 638 | > 1. For small datasets (30 minutes or even smaller), it is not recommended to train for too long when loading the base model. This is to make the best use of the advantages of the base model. The best results can be achieved in thousands or even hundreds of steps. 639 | > 2. The audio samples in Tensorboard are generated based on your validation set and **cannot represent the final performance of the model**. 640 | 641 | # 3. Inference 642 | 643 | ✨ Before inference, please prepare the dry audio you need for inference, ensuring it has no background noise/reverb and is of good quality. You can use [UVR5](https://github.com/Anjok07/ultimatevocalremovergui/releases/tag/v5.6) for processing to obtain the dry audio. Additionally, I've also created a [UVR5 vocal separation tutorial](https://www.bilibili.com/video/BV1F4421c7qU/). 644 | 645 | ## 3.1 Command-line Inference 646 | 647 | Perform inference using inference_main.py 648 | 649 | ```bash 650 | # Example 651 | python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "your_inference_audio.wav" -t 0 -s "speaker" 652 | ``` 653 | 654 | **Required Parameters:** 655 | 656 | - `-m` | `--model_path`: Path to the model 657 | - `-c` | `--config_path`: Path to the configuration file 658 | - `-n` | `--clean_names`: List of wav file names, placed in the raw folder 659 | - `-t` | `--trans`: Pitch adjustment, supports positive and negative (in semitones) 660 | - `-s` | `--spk_list`: Name of the target speaker for synthesis 661 | - `-cl` | `--clip`: Audio forced clipping, default 0 for automatic clipping, unit in seconds/s. 662 | 663 | > [!NOTE] 664 | > 665 | > **Audio Clipping** 666 | > 667 | > - During inference, the clipping tool will split the uploaded audio into several small segments based on silence sections, and then combine them after inference to form the complete audio. This approach benefits from lower GPU memory usage for small audio segments, thus enabling the segmentation of long audio for inference to avoid GPU memory overflow. The clipping threshold parameter controls the minimum full-scale decibel value, and anything lower will be considered as silence and removed. Therefore, when the uploaded audio is noisy, you can set this parameter higher (e.g., -30), whereas for cleaner audio, a lower value (e.g., -50) can be set to avoid cutting off breath sounds and faint voices. 668 | > 669 | > - A recent test by the development team suggests that smaller clipping thresholds (e.g., -50) improve the clarity of the output, although the principle behind this is currently unclear. 670 | > 671 | > **Forced Clipping** `-cl` | `--clip` 672 | > 673 | > - During inference, the clipping tool may sometimes produce overly long audio segments when continuous vocal sections exist without silence for an extended period, potentially causing GPU memory overflow. The automatic audio clipping feature sets a maximum duration for audio segmentation. After the initial segmentation, if there are audio segments longer than this duration, they will be forcibly re-segmented at this duration to avoid memory overflow issues. 674 | > - Forced clipping may result in cutting off audio in the middle of a word, leading to discontinuity in the synthesized voice. You need to set the crossfade length for forced clipping in advanced settings to mitigate this issue. 675 | 676 | **Optional Parameters: See Next Section for Specifics** 677 | 678 | - `-lg` | `--linear_gradient`: Crossfade length of two audio clips, adjust this value if there are discontinuities in the voice after forced clipping, recommended to use default value 0, unit in seconds 679 | - `-f0p` | `--f0_predictor`: Choose F0 predictor, options are crepe, pm, dio, harvest, rmvpe, fcpe, default is pm (Note: crepe uses mean filter for original F0), refer to the advantages and disadvantages of different F0 predictors in [2.4.3 F0 Predictor Advantages and Disadvantages](#243-generating-hubert-and-f0) 680 | - `-a` | `--auto_predict_f0`: Automatically predict pitch during voice conversion, do not enable this when converting singing voices as it may severely mis-tune 681 | - `-cm` | `--cluster_model_path`: Path to clustering model or feature retrieval index, leave empty to automatically set to the default path of each solution model, fill in randomly if no clustering or feature retrieval is trained 682 | - `-cr` | `--cluster_infer_ratio`: Ratio of clustering solution or feature retrieval, range 0-1, defaults to 0 if no clustering model or feature retrieval is trained 683 | - `-eh` | `--enhance`: Whether to use the NSF_HIFIGAN enhancer, this option has a certain sound quality enhancement effect on models with a limited training set, but has a negative effect on well-trained models, default is off 684 | - `-shd` | `--shallow_diffusion`: Whether to use shallow diffusion, enabling this can solve some electronic sound problems, default is off, when this option is enabled, the NSF_HIFIGAN enhancer will be disabled 685 | - `-usm` | `--use_spk_mix`: Whether to use speaker blending/dynamic voice blending 686 | - `-lea` | `--loudness_envelope_adjustment`: Ratio of input source loudness envelope replacement to output loudness envelope fusion, the closer to 1, the more the output loudness envelope is used 687 | - `-fr` | `--feature_retrieval`: Whether to use feature retrieval, if a clustering model is used, it will be disabled, and the cm and cr parameters will become the index path and mixing ratio of feature retrieval 688 | 689 | > [!NOTE] 690 | > 691 | > **Clustering Model/Feature Retrieval Mixing Ratio** `-cr` | `--cluster_infer_ratio` 692 | > 693 | > - This parameter controls the proportion of linear involvement when using clustering models/feature retrieval models. Clustering models and feature retrieval models can both slightly improve timbre similarity, but at the cost of reducing accuracy in pronunciation (feature retrieval has slightly better pronunciation than clustering). The range of this parameter is 0-1, where 0 means it is not enabled, and the closer to 1, the more similar the timbre and the blurrier the pronunciation. 694 | > - Clustering models and feature retrieval share this parameter. When loading models, the model used will be controlled by this parameter. 695 | > - **Note that when clustering models or feature retrieval models are not loaded, please keep this parameter as 0, otherwise an error will occur.** 696 | 697 | **Shallow Diffusion Settings:** 698 | 699 | - `-dm` | `--diffusion_model_path`: Diffusion model path 700 | - `-dc` | `--diffusion_config_path`: Diffusion model configuration file path 701 | - `-ks` | `--k_step`: Number of diffusion steps, larger values are closer to the result of the diffusion model, default is 100 702 | - `-od` | `--only_diffusion`: Pure diffusion mode, this mode does not load the sovits model and performs inference based only on the diffusion model 703 | - `-se` | `--second_encoding`: Secondary encoding, the original audio will be encoded a second time before shallow diffusion, a mysterious option, sometimes it works well, sometimes it doesn't 704 | 705 | > [!NOTE] 706 | > 707 | > **About Shallow Diffusion Steps** `-ks` | `--k_step` 708 | > 709 | > The complete Gaussian diffusion takes 1000 steps. When the number of shallow diffusion steps reaches 1000, the output result at this point is entirely the output result of the diffusion model, and the So-VITS model will be suppressed. The higher the number of shallow diffusion steps, the closer it is to the output result of the diffusion model. **If you only want to use shallow diffusion to remove electronic noise while preserving the timbre of the So-VITS model as much as possible, the number of shallow diffusion steps can be set to 30-50** 710 | 711 | > [!WARNING] 712 | > 713 | > If using the `whisper-ppg` voice encoder for inference, `--clip` should be set to 25, `-lg` should be set to 1. Otherwise, inference will not work properly. 714 | 715 | ## 3.2 webUI Inference 716 | 717 | Use the following command to open the webUI interface, **upload and load the model, fill in the inference as needed according to the instructions, upload the inference audio, and start the inference.** 718 | 719 | The detailed explanation of the inference parameters is the same as the [3.1 Command-line Inference](#31-command-line-inference) parameters, but moved to the interactive interface with simple instructions. 720 | 721 | ```bash 722 | python webUI.py 723 | ``` 724 | 725 | > [!WARNING] 726 | > 727 | > **Be sure to check [Command-line Inference](#31-command-line-inference) to understand the meanings of specific parameters. Pay special attention to the reminders in NOTE and WARNING!** 728 | 729 | The webUI also has a built-in **text-to-speech (TTS)** function: 730 | 731 | - Text-to-speech uses Microsoft's edge_TTS service to generate a piece of original speech, and then converts the voice of this speech to the target voice using So-VITS. So-VITS can only achieve voice conversion (SVC) for singing voices, and does not have any **native** text-to-speech (TTS) function! Since the speech generated by Microsoft's edge_TTS is relatively stiff and lacks emotion, all converted audio will also reflect this. **If you need a TTS function with emotions, please visit the [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) project.** 732 | - Currently, text-to-speech supports a total of 55 languages, covering most common languages. The program will automatically recognize the language based on the text entered in the text box and convert it. 733 | - Automatic recognition can only recognize the language, and certain languages may have different accents, speakers. If automatic recognition is used, the program will randomly select one from the speakers that fit the language and specified gender for conversion. If your target language has multiple accents or speakers (e.g., English), it is recommended to manually specify one speaker with a specific accent. If a speaker is manually specified, the previously manually selected gender will be suppressed. 734 | 735 | # 4. Optional Enhancements 736 | 737 | ✨ If you are satisfied with the previous effects, or didn't quite understand what's being discussed below, you can ignore the following content without affecting model usage (these optional enhancements have relatively minor effects, and may only have some effect on specific data, but in most cases, the effect is not very noticeable). 738 | 739 | ## 4.1 Automatic F0 Prediction 740 | 741 | During model training, an F0 predictor is trained, which is an automatic pitch shifting function that can match the model pitch to the source pitch during inference, useful for voice conversion where it can better match the pitch. **However, do not enable this feature when converting singing voices!!! It will severely mis-tune!!!** 742 | 743 | - Command-line Inference: Set `auto_predict_f0` to `true` in `inference_main`. 744 | - WebUI Inference: Check the corresponding option. 745 | 746 | ## 4.2 Clustering Timbre Leakage Control 747 | 748 | Clustering schemes can reduce timbre leakage, making the model output more similar to the target timbre (though not very obvious). However, using clustering alone can reduce the model's pronunciation clarity (making it unclear), this model adopts a fusion approach, allowing linear control of the proportion of clustering schemes and non-clustering schemes. In other words, you can manually adjust the ratio between "similar to target timbre" and "clear pronunciation" to find a suitable balance. 749 | 750 | Using clustering does not require any changes to the existing steps mentioned earlier, just need to train an additional clustering model, although the effect is relatively limited, the training cost is also relatively low. 751 | 752 | - Training Method: 753 | 754 | ```bash 755 | # Train using CPU: 756 | python cluster/train_cluster.py 757 | # Or train using GPU: 758 | python cluster/train_cluster.py --gpu 759 | ``` 760 | 761 | After training, the model output will be saved in `logs/44k/kmeans_10000.pt` 762 | 763 | - During Command-line Inference: 764 | - Specify `cluster_model_path` in `inference_main.py` 765 | - Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using clustering at all, `1` means only using clustering. Usually, setting `0.5` is sufficient. 766 | - During WebUI Inference: 767 | - Upload and load the clustering model. 768 | - Set the clustering model/feature retrieval mixing ratio, between 0-1, where 0 means not using clustering/feature retrieval at all. Using clustering/feature retrieval can improve timbre similarity but may result in reduced pronunciation clarity (if used, it's recommended to set around 0.5). 769 | 770 | ## 4.3 Feature Retrieval 771 | 772 | Similar to clustering schemes, feature retrieval can also reduce timbre leakage, with slightly better pronunciation clarity than clustering, but it may reduce inference speed. Adopting a fusion approach, it allows linear control of the proportion of feature retrieval and non-feature retrieval. 773 | 774 | - Training Process: After generating hubert and f0, execute: 775 | 776 | ```bash 777 | python train_index.py -c configs/config.json 778 | ``` 779 | 780 | After training, the model output will be saved in `logs/44k/feature_and_index.pkl` 781 | 782 | - During Command-line Inference: 783 | - Specify `--feature_retrieval` first, and the clustering scheme will automatically switch to the feature retrieval scheme 784 | - Specify `cluster_model_path` in `inference_main.py` as the model output file 785 | - Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using feature retrieval at all, `1` means only using feature retrieval. Usually, setting `0.5` is sufficient. 786 | - During WebUI Inference: 787 | - Upload and load the clustering model 788 | - Set the clustering model/feature retrieval mixing ratio, between 0-1, where 0 means not using clustering/feature retrieval at all. Using clustering/feature retrieval can improve timbre similarity but may result in reduced pronunciation clarity (if used, it's recommended to set around 0.5) 789 | 790 | ## 4.4 Vocoder Fine-tuning 791 | 792 | When using the diffusion model in So-VITS, the Mel spectrogram enhanced by the diffusion model is output as the final audio through the vocoder. The vocoder plays a decisive role in the sound quality of the output audio. So-VITS-SVC currently uses the [NSF-HiFiGAN community vocoder](https://openvpi.github.io/vocoders/). In fact, you can also fine-tune this vocoder model with your own dataset to better suit your model task in the **diffusion process** of So-VITS. 793 | 794 | The [SingingVocoders](https://github.com/openvpi/SingingVocoders) project provides methods for fine-tuning the vocoder. In the Diffusion-SVC project, **using a fine-tuned vocoder can greatly enhance the output sound quality**. You can also train a fine-tuned vocoder with your own dataset and use it in this integration package. 795 | 796 | 1. Train a fine-tuned vocoder using [SingingVocoders](https://github.com/openvpi/SingingVocoders) and obtain its model and configuration files. 797 | 2. Place the model and configuration files under `pretrain/{fine-tuned vocoder name}/`. 798 | 3. Modify the diffusion model configuration file `diffusion.yaml` of the corresponding model as follows: 799 | 800 | ```yaml 801 | vocoder: 802 | ckpt: pretrain/nsf_hifigan/model.ckpt # This line is the path to your fine-tuned vocoder model 803 | type: nsf-hifigan # This line is the type of your fine-tuned vocoder, do not modify if unsure 804 | ``` 805 | 806 | 1. Following [3.2 webUI Inference](#32-webui-inference), upload the diffusion model and the **modified diffusion model configuration file** to use the fine-tuned vocoder. 807 | 808 | > [!WARNING] 809 | > 810 | > **Currently, only the NSF-HiFiGAN vocoder supports fine-tuning.** 811 | 812 | ## 4.5 Directories for Saved Models 813 | 814 | Up to the previous section, a total of 4 types of models that can be trained have been covered. The following table summarizes these four types of models and their configuration files. 815 | 816 | In the webUI, in addition to uploading and loading models, you can also read local model files. You just need to put these models into a folder first, and then put the folder into the `trained` folder. Click "Refresh Local Model List", and the webUI will recognize it. Then manually select the model you want to load for loading. **Note**: Automatic loading of local models may not work properly for the (optional) models in the table below. 817 | 818 | | File | Filename and Extension | Location | 819 | | --------------------------------------------- | ----------------------- | -------------------- | 820 | | So-VITS Model | `G_xxxx.pth` | `logs/44k` | 821 | | So-VITS Model Configuration File | `config.json` | `configs` | 822 | | Diffusion Model (Optional) | `model_xxxx.pt` | `logs/44k/diffusion` | 823 | | Diffusion Model Configuration File (Optional) | `diffusion.yaml` | `configs` | 824 | | Kmeans Clustering Model (Optional) | `kmeans_10000.pt` | `logs/44k` | 825 | | Feature Retrieval Model (Optional) | `feature_and_index.pkl` | `logs/44k` | 826 | 827 | # 5. Other Optional Features 828 | 829 | ✨ This part is less important compared to the previous sections. Except for [5.1 Model Compression](#51-model-compression), which is a more convenient feature, the probability of using the other optional features is relatively low. Therefore, only references to the official documentation and brief descriptions are provided here. 830 | 831 | ## 5.1 Model Compression 832 | 833 | The generated models contain information needed for further training. If you are **sure not to continue training**, you can remove this part of the information from the model to obtain a final model that is about 1/3 of the size. 834 | 835 | Use `compress_model.py` 836 | 837 | ```bash 838 | # For example, if I want to compress a model named G_30400.pth under the logs/44k/ directory, and the configuration file is configs/config.json, I can run the following command 839 | python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth" 840 | # The compressed model is saved in logs/44k/release.pth 841 | ``` 842 | 843 | > [!WARNING] 844 | > 845 | > **Note: Compressed models cannot be further trained!** 846 | 847 | ## 5.2 Voice Mixing 848 | 849 | ### 5.2.1 Static Voice Mixing 850 | 851 | **Refer to the static voice mixing feature in the `webUI.py` file under the Tools/Experimental Features.** 852 | 853 | This feature can combine multiple voice models into one voice model (convex combination or linear combination of multiple model parameters), thus creating voice characteristics that do not exist in reality. 854 | 855 | **Note:** 856 | 857 | 1. This feature only supports single-speaker models. 858 | 2. If you forcibly use multi-speaker models, you need to ensure that the number of speakers in multiple models is the same, so that voices under the same SpaekerID can be mixed. 859 | 3. Ensure that the `model` field in the config.json of all models to be mixed is the same. 860 | 4. The output mixed model can use any config.json of the models to be mixed, but the clustering model will not be available. 861 | 5. When uploading models in batches, it is better to put the models in a folder and upload them together. 862 | 6. It is recommended to adjust the mixing ratio between 0 and 100. Other numbers can also be adjusted, but unknown effects may occur in linear combination mode. 863 | 7. After mixing, the file will be saved in the project root directory with the filename output.pth. 864 | 8. Convex combination mode will execute Softmax on the mixing ratio to ensure that the sum of mixing ratios is 1, while linear combination mode will not. 865 | 866 | ### 5.2.2 Dynamic Voice Mixing 867 | 868 | **Refer to the introduction of dynamic voice mixing in the `spkmix.py` file.** 869 | 870 | Rules for mixing role tracks: 871 | 872 | - Speaker ID: \[\[Start Time 1, End Time 1, Start Value 1, End Value 1], [Start Time 2, End Time 2, Start Value 2, End Value 2]] 873 | - The start time must be the same as the end time of the previous one, and the first start time must be 0, and the last end time must be 1 (the time range is 0-1). 874 | - All roles must be filled in, and roles that are not used can be filled with \[\[0., 1., 0., 0.]]. 875 | - The fusion value can be filled arbitrarily. Within the specified time range, it changes linearly from the start value to the end value. The internal will automatically ensure that the linear combination is 1 (convex combination condition), so you can use it with confidence. 876 | 877 | Use the `--use_spk_mix` parameter during command line inference to enable dynamic voice mixing. Check the "Dynamic Voice Mixing" option box during webUI inference. 878 | 879 | ## 5.3 Onnx Export 880 | 881 | Use `onnx_export.py`. Currently, only [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) requires the use of onnx models. For more detailed operations and usage methods, please refer to the [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) repository instructions. 882 | 883 | - Create a new folder: `checkpoints` and open it 884 | - In the `checkpoints` folder, create a folder as the project folder, named after your project, such as `aziplayer` 885 | - Rename your model to `model.pth` and the configuration file to `config.json`, and place them in the `aziplayer` folder you just created 886 | - Change `"NyaruTaffy"` in `path = "NyaruTaffy"` in `onnx_export.py` to your project name, `path = "aziplayer" (onnx_export_speaker_mix, for onnx export supporting role mixing)` 887 | - Run `python onnx_export.py` 888 | - Wait for execution to complete. A `model.onnx` will be generated in your project folder, which is the exported model. 889 | 890 | Note: Use onnx models provided by [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio) for Hubert Onnx models. Currently, it cannot be exported independently (Hubert in fairseq has many operators not supported by onnx and involves constants, which will cause errors or problems with the input and output shapes and results during export). 891 | 892 | # 6. Simple Mixing and Exporting Finished Product 893 | 894 | ### Use Audio Host Software to Process Inferred Audio 895 | 896 | Please refer to the [corresponding video tutorial](https://www.bilibili.com/video/BV1Hr4y197Cy/) or other professional mixing tutorials for details on how to handle and enhance the inferred audio using audio host software. 897 | 898 | # Appendix: Common Errors and Solutions 899 | 900 | ✨ **Some error solutions are credited to [羽毛布団](https://space.bilibili.com/3493141443250876)'s [related column](https://www.bilibili.com/read/cv22206231) | [related documentation](https://www.yuque.com/umoubuton/ueupp5/ieinf8qmpzswpsvr)** 901 | 902 | ## About Out of Memory (OOM) 903 | 904 | If you encounter an error like this in the terminal or WebUI: 905 | 906 | ```bash 907 | OutOfMemoryError: CUDA out of memory. Tried to allocate XX GiB (GPU 0: XX GiB total capacity; XX GiB already allocated; XX MiB Free; XX GiB reserved in total by PyTorch) 908 | ``` 909 | 910 | Don't doubt it, your GPU memory or virtual memory is insufficient. The following steps provide a 100% solution to the problem. Follow these steps to resolve the issue. Please avoid asking this question in various places as the solution is well-documented. 911 | 912 | 1. In the error message, find if `XX GiB already allocated` is followed by `0 bytes free`. If it shows `0 bytes free`, follow steps 2, 3, and 4. If it shows `XX MiB free` or `XX GiB free`, follow step 5. 913 | 2. If the out of memory occurs during preprocessing: 914 | - Use a GPU-friendly F0 predictor (from highest to lowest friendliness: pm >= harvest >= rmvpe ≈ fcpe >> crepe). It is recommended to use rmvpe or fcpe first. 915 | - Set multi-process preprocessing to 1. 916 | 3. If the out of memory occurs during training: 917 | - Check if there are any excessively long clips in the dataset (more than 20 seconds). 918 | - Reduce the batch size. 919 | - Use a project with lower resource requirements. 920 | - Rent a GPU with larger memory from platforms like Google Colab for training. 921 | 4. If the out of memory occurs during inference: 922 | - Ensure the source audio (dry vocal) is clean (no residual reverb, accompaniment, or harmony) as dirty sources can hinder automatic slicing. Refer to the [UVR5 vocal separation tutorial](https://www.bilibili.com/video/BV1F4421c7qU/) for best practices. 923 | - Increase the slicing threshold (e.g., change from -40 to -30; avoid going too high as it can cut the audio abruptly). 924 | - Set forced slicing, starting from 60 seconds and decreasing by 10 seconds each time until inference succeeds. 925 | - Use CPU for inference, which will be slower but won't encounter out of memory issues. 926 | 5. If there is still available memory but the out of memory error persists, increase your virtual memory to at least 50G. 927 | 928 | These steps should help you manage and resolve out of memory errors effectively, ensuring smooth operation during preprocessing, training, and inference. 929 | 930 | ## Common Errors and Solutions When Installing Dependencies 931 | 932 | **1. Error When Installing PyTorch with CUDA=11.7** 933 | 934 | ``` 935 | ERROR: Package 'networkx' requires a different Python: 3.8.9 not in '>=3.9 936 | ``` 937 | 938 | **Solutions:** 939 | 940 | - **Upgrade Python to 3.10:** 941 | - **(Recommend)Keep the Python version the same:** First, install `networkx` with version 3.0 before installing PyTorch. 942 | 943 | ```bash 944 | pip install networkx==3.0 945 | # Then proceed with the installation of PyTorch. 946 | ``` 947 | 948 | **2. Dependency Not Found** 949 | 950 | If you encounter errors similar to: 951 | 952 | ```bash 953 | ERROR: Could not find a version that satisfies the requirement librosa==0.9.1 (from versions: none) 954 | ERROR: No matching distribution found for librosa==0.9.1 955 | # Key characteristics of the error: 956 | No matching distribution found for xxxxx 957 | Could not find a version that satisfies the requirement xxxx 958 | ``` 959 | 960 | **Solution:** Change the installation source. Add a download source when manually installing the dependency. 961 | 962 | Use the command `pip install [package_name] -i [source_url]`. For example, to download `librosa` version 0.9.1 from the Alibaba source, use the following command: 963 | 964 | ```bash 965 | pip install librosa==0.9.1 -i http://mirrors.aliyun.com/pypi/simple 966 | ``` 967 | 968 | **3. Certain dependencies cannot be installed due to a high pip version** 969 | 970 | On June 21, 2024, pip was updated to version 24.1. Simply using `pip install --upgrade pip` will update pip to version 24.1. However, some dependencies require pip 23.0 to be installed, necessitating a manual downgrade of the pip version. It is currently known that hydra-core, omegaconf, and fastapi are affected by this. The specific error encountered during installation is as follows: 971 | 972 | ```bash 973 | Please use pip<24.1 if you need to use this version. 974 | INFO: pip is looking at multiple versions of hydra-core to determine which version is compatible with other requirements. This could take a while. 975 | ERROR: Cannot install -r requirements.txt (line 20) and fairseq because these package versions have conflicting dependencies. 976 | 977 | The conflict is caused by: 978 | fairseq 0.12.2 depends on omegaconf<2.1 979 | hydra-core 1.0.7 depends on omegaconf<2.1 and >=2.0.5 980 | 981 | To fix this you could try to: 982 | 1. loosen the range of package versions you've specified 983 | 2. remove package versions to allow pip to attempt to solve the dependency conflict 984 | ``` 985 | 986 | The solution is to limit the pip version before installing dependencies as described in [1.5 Installation of Other Dependencies](#15-installation-of-other-dependencies). Use the following command to limit the pip version: 987 | 988 | ```bash 989 | pip install --upgrade pip==23.3.2 wheel setuptools 990 | ``` 991 | 992 | After running this command, proceed with the installation of other dependencies. 993 | 994 | ## Common Errors During Dataset Preprocessing and Model Training 995 | 996 | **1. Error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position xx`** 997 | 998 | - Ensure that dataset filenames do not contain non-Western characters such as Chinese or Japanese, especially Chinese punctuation marks like brackets, commas, colons, semicolons, quotes, etc. After renaming, **reprocess the dataset** and then proceed with training. 999 | 1000 | **2. Error: `The expand size of the tensor (768) must match the existing size (256) at non-singleton dimension 0.`** 1001 | 1002 | - Delete all contents under `dataset/44k` and redo the preprocessing steps. 1003 | 1004 | **3.Error: `RuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly`** 1005 | 1006 | ```bash 1007 | raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e 1008 | RuntimeError: DataLoader worker (pid(s) 13920) exited unexpectedly 1009 | ``` 1010 | 1011 | - Reduce the `batchsize` value, increase virtual memory, and restart the computer to clear GPU memory until the `batchsize` value and virtual memory are appropriate and do not cause errors. 1012 | 1013 | **4. Error: `torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 3221225477`** 1014 | 1015 | - Increase virtual memory and reduce the `batchsize` value until the `batchsize` value and virtual memory are appropriate and do not cause errors. 1016 | 1017 | **5. Error: `AssertionError: CPU training is not allowed.`** 1018 | 1019 | - **No solution:** Training without an NVIDIA GPU is not supported. For beginners, the straightforward answer is that training without an NVIDIA GPU is not feasible. 1020 | 1021 | **6. Error: `FileNotFoundError: No such file or directory: 'pretrain/rmvpe.pt'`** 1022 | 1023 | - If you run `python preprocess_hubert_f0.py --f0_predictor rmvpe --use_diff` and encounter `FileNotFoundError: No such file or directory: 'pretrain/rmvpe.pt'`, this is because the official documentation updated the rmvpe preprocessor for F0 processing. Refer to the tutorial documentation [#2.2.3](#223-optional-items-choose-as-needed) to download the preprocessing model `rmvpe.pt` and place it in the corresponding directory. 1024 | 1025 | **7. Error: "Page file is too small to complete the operation."** 1026 | 1027 | - **Solution:** Increase the virtual memory. You can find detailed instructions online for your specific operating system. 1028 | 1029 | ## Errors When Using WebUI\*\* 1030 | 1031 | **1. Errors When Starting or Loading Models in WebUI** 1032 | 1033 | - **Error When Starting WebUI:** `ImportError: cannot import name 'Schema' from 'pydantic'` 1034 | - **Error When Loading Models in WebUI:** `AttributeError("'Dropdown' object has no attribute 'update'")` 1035 | - **Errors Related to Dependencies:** If the error involves `fastapi`, `gradio`, or `pydantic`. 1036 | 1037 | **Solution:** 1038 | 1039 | - Some dependencies need specific versions. After installing `requirements_win.txt`, enter the following commands in the command prompt to update the packages: 1040 | 1041 | ```bash 1042 | pip install --upgrade fastapi==0.84.0 1043 | pip install --upgrade gradio==3.41.2 1044 | pip install --upgrade pydantic==1.10.12 1045 | ``` 1046 | 1047 | **2. Error: `Given groups=1, weight of size [xxx, 256, xxx], expected input[xxx, 768, xxx] to have 256 channels, but got 768 channels instead`** 1048 | 1049 | or **Error: Encoder and model dimensions do not match in the configuration file** 1050 | 1051 | - **Cause:** A v1 branch model is using a `vec768` configuration file, or vice versa. 1052 | - **Solution:** Check the `ssl_dim` setting in your configuration file. If `ssl_dim` is 256, the `speech_encoder` should be `vec256|9`. If it is 768, it should be `vec768|12`. 1053 | - For detailed instructions, refer to [#2.1](#21-issues-regarding-compatibility-with-the-40-model). 1054 | 1055 | **3. Error: `'HParams' object has no attribute 'xxx'`** 1056 | 1057 | - **Cause:** Usually, this indicates that the timbre cannot be found and the configuration file does not match the model. 1058 | - **Solution:** Open the configuration file and scroll to the bottom to check if it includes the timbre you trained. 1059 | 1060 | # Acknowledgements 1061 | 1062 | We would like to extend our heartfelt thanks to the following individuals and organizations whose contributions and resources have made this project possible: 1063 | 1064 | - **so-vits-svc** | [so-vits-svc GitHub Repository](https://github.com/svc-develop-team/so-vits-svc) 1065 | - **GPT-SoVITS** | [GPT-SoVITS GitHub Repository](https://github.com/RVC-Boss/GPT-SoVITS) 1066 | - **SingingVocoders** | [SingingVocoders GitHub Repository](https://github.com/openvpi/SingingVocoders) 1067 | - **MoeVoiceStudio** | [MoeVoiceStudio GitHub Repository](https://github.com/NaruseMioShirakana/MoeVoiceStudio) 1068 | - **OpenVPI** | [OpenVPI GitHub Organization](https://github.com/openvpi) | [Vocoders GitHub Repository](https://github.com/openvpi/vocoders) 1069 | - **Up 主 [infinite_loop]** | [Bilibili Profile](https://space.bilibili.com/286311429) | [Related Video](https://www.bilibili.com/video/BV1Bd4y1W7BN) | [Related Column](https://www.bilibili.com/read/cv21425662) 1070 | - **Up 主 [羽毛布団]** | [Bilibili Profile](https://space.bilibili.com/3493141443250876) | [Error Resolution Guide](https://www.bilibili.com/read/cv22206231) | [Common Error Solutions](https://www.yuque.com/umoubuton/ueupp5/ieinf8qmpzswpsvr) 1071 | - **All Contributors of Training Audio Samples** 1072 | - **You** - For your interest, support, and contributions. 1073 | --------------------------------------------------------------------------------