├── .gitignore ├── Changelog_CN.md ├── LICENSE ├── README.md ├── README_en.md ├── RVC改进意见.txt ├── Retrieval_based_Voice_Conversion_WebUI.ipynb ├── config.py ├── configs ├── 32k.json ├── 40k.json └── 48k.json ├── envfilescheck.bat ├── export_onnx.py ├── extract_f0_print.py ├── extract_feature_print.py ├── go-web_jp.bat ├── gui.py ├── infer-web.py ├── infer-web_jp.py ├── infer ├── infer-pm-index256.py ├── train-index.py └── trans_weights.py ├── infer_pack ├── attentions.py ├── commons.py ├── models.py ├── models_onnx.py ├── modules.py └── transforms.py ├── infer_uvr5.py ├── logs └── mute │ ├── 0_gt_wavs │ ├── mute32k.wav │ ├── mute40k.wav │ └── mute48k.wav │ ├── 1_16k_wavs │ └── mute.wav │ ├── 2a_f0 │ └── mute.wav.npy │ ├── 2b-f0nsf │ └── mute.wav.npy │ └── 3_feature256 │ └── mute.npy ├── my_utils.py ├── poetry.lock ├── pretrained └── .gitignore ├── pyproject.toml ├── requirements.txt ├── slicer2.py ├── train ├── cmd.txt ├── data_utils.py ├── losses.py ├── mel_processing.py ├── process_ckpt.py └── utils.py ├── train_nsf_sim_cache_sid_load_pretrain.py ├── trainset_preprocess_pipeline_print.py ├── uvr5_pack ├── lib_v5 │ ├── dataset.py │ ├── layers.py │ ├── layers_123812KB .py │ ├── layers_123821KB.py │ ├── layers_33966KB.py │ ├── layers_537227KB.py │ ├── layers_537238KB.py │ ├── model_param_init.py │ ├── modelparams │ │ ├── 1band_sr16000_hl512.json │ │ ├── 1band_sr32000_hl512.json │ │ ├── 1band_sr33075_hl384.json │ │ ├── 1band_sr44100_hl1024.json │ │ ├── 1band_sr44100_hl256.json │ │ ├── 1band_sr44100_hl512.json │ │ ├── 1band_sr44100_hl512_cut.json │ │ ├── 2band_32000.json │ │ ├── 2band_44100_lofi.json │ │ ├── 2band_48000.json │ │ ├── 3band_44100.json │ │ ├── 3band_44100_mid.json │ │ ├── 3band_44100_msb2.json │ │ ├── 4band_44100.json │ │ ├── 4band_44100_mid.json │ │ ├── 4band_44100_msb.json │ │ ├── 4band_44100_msb2.json │ │ ├── 4band_44100_reverse.json │ │ ├── 4band_44100_sw.json │ │ ├── 4band_v2.json │ │ ├── 4band_v2_sn.json │ │ └── ensemble.json │ ├── nets.py │ ├── nets_123812KB.py │ ├── nets_123821KB.py │ ├── nets_33966KB.py │ ├── nets_537227KB.py │ ├── nets_537238KB.py │ ├── nets_61968KB.py │ └── spec_utils.py └── utils.py ├── uvr5_weights └── .gitignore ├── vc_infer_pipeline.py ├── weights └── .gitignore ├── 使用需遵守的协议-LICENSE.txt └── 小白简易教程.doc /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | __pycache__ 3 | /TEMP 4 | *.pyd 5 | hubert_base.pt 6 | /logs 7 | -------------------------------------------------------------------------------- /Changelog_CN.md: -------------------------------------------------------------------------------- 1 | 20230409 2 | 3 | 修正训练参数,提升显卡平均利用率,A100最高从25%提升至90%左右,V100:50%->90%左右,2060S:60%->85%左右,P40:25%->95%左右,训练速度显著提升 4 | 5 | 修正参数:总batch_size改为每张卡的batch_size 6 | 7 | 修正total_epoch:最大限制100解锁至1000;默认10提升至默认20 8 | 9 | 修复ckpt提取识别是否带音高错误导致推理异常的问题 10 | 11 | 修复分布式训练每个rank都保存一次ckpt的问题 12 | 13 | 特征提取进行nan特征过滤 14 | 15 | 修复静音输入输出随机辅音or噪声的问题(老版模型需要重做训练集重训) 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 liujing04 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 |

Retrieval-based-Voice-Conversion-WebUI

4 | 一个基于VITS的简单易用的语音转换(变声器)框架

5 | 6 | [![madewithlove](https://forthebadge.com/images/badges/built-with-love.svg)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI) 7 | 8 |
9 | 10 | [![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/Retrieval_based_Voice_Conversion_WebUI.ipynb) 11 | [![Licence](https://img.shields.io/github/license/liujing04/Retrieval-based-Voice-Conversion-WebUI?style=for-the-badge)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/%E4%BD%BF%E7%94%A8%E9%9C%80%E9%81%B5%E5%AE%88%E7%9A%84%E5%8D%8F%E8%AE%AE-LICENSE.txt) 12 | [![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-blue.svg?style=for-the-badge)](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/) 13 | 14 |
15 | 16 | ------ 17 | 18 | [**更新日志**](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/Changelog_CN.md) 19 | 20 | [**English**](./README_en.md) | [**中文简体**](./README.md) 21 | 22 | > 点此查看我们的[演示视频](https://www.bilibili.com/video/BV1pm4y1z7Gm/) ! 23 | 24 | > 使用了RVC的实时语音转换: [w-okada/voice-changer](https://github.com/w-okada/voice-changer) 25 | 26 | ## 简介 27 | 本仓库具有以下特点: 28 | + 使用top1特征模型检索来杜绝音色泄漏; 29 | + 即便在相对较差的显卡上也能快速训练; 30 | + 使用少量数据进行训练也能得到较好结果; 31 | + 可以通过模型融合来改变音色; 32 | + 简单易用的WebUI界面; 33 | + 可调用UVR5模型来快速分离人声和伴奏。 34 | + 底模训练集使用接近50小时的高质量VCTK开源,后续会陆续加入高质量有授权歌声训练集供大家放心使用。 35 | ## 环境配置 36 | 我们推荐你使用poetry来配置环境。 37 | 38 | 以下指令需在Python版本大于3.8的环境当中执行: 39 | ```bash 40 | # 安装Pytorch及其核心依赖,若已安装则跳过 41 | # 参考自: https://pytorch.org/get-started/locally/ 42 | pip install torch torchvision torchaudio 43 | 44 | 如果是win系统+30系显卡,根据https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21的经验,需要指定pytorch对应的cuda版本 45 | 46 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 47 | 48 | # 安装 Poetry 依赖管理工具, 若已安装则跳过 49 | # 参考自: https://python-poetry.org/docs/#installation 50 | curl -sSL https://install.python-poetry.org | python3 - 51 | 52 | # 通过poetry安装依赖 53 | poetry install 54 | ``` 55 | 56 | 你也可以通过pip来安装依赖: 57 | 58 | **注意**: `MacOS`下`faiss 1.7.2`版本会导致抛出段错误,请将`requirements.txt`的对应条目改为`faiss-cpu==1.7.0` 59 | 60 | ```bash 61 | pip install -r requirements.txt 62 | ``` 63 | 64 | ## 其他预模型准备 65 | RVC需要其他的一些预模型来推理和训练。 66 | 67 | 你可以从我们的[Huggingface space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)下载到这些模型。 68 | 69 | 以下是一份清单,包括了所有RVC所需的预模型和其他文件的名称: 70 | ```bash 71 | hubert_base.pt 72 | 73 | ./pretrained 74 | 75 | ./uvr5_weights 76 | 77 | #如果你正在使用Windows,则你可能需要这个文件夹,若FFmpeg已安装则跳过 78 | ./ffmpeg 79 | ``` 80 | 之后使用以下指令来调用Webui: 81 | ```bash 82 | python infer-web.py 83 | ``` 84 | 如果你正在使用Windows,你可以直接下载并解压`RVC-beta.7z` 来使用RVC,运行`go-web.bat`来启动WebUI。 85 | 86 | 我们将在两周内推出一个英文版本的WebUI. 87 | 88 | 仓库内还有一份`小白简易教程.doc`以供参考。 89 | 90 | ## 参考项目 91 | + [ContentVec](https://github.com/auspicious3000/contentvec/) 92 | + [VITS](https://github.com/jaywalnut310/vits) 93 | + [HIFIGAN](https://github.com/jik876/hifi-gan) 94 | + [Gradio](https://github.com/gradio-app/gradio) 95 | + [FFmpeg](https://github.com/FFmpeg/FFmpeg) 96 | + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui) 97 | + [audio-slicer](https://github.com/openvpi/audio-slicer) 98 | ## 感谢所有贡献者作出的努力 99 | 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /README_en.md: -------------------------------------------------------------------------------- 1 | # Retrieval-based-Voice-Conversion-WebUI 2 | 3 | [![madewithlove](https://forthebadge.com/images/badges/built-with-love.svg)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI) 4 | 5 | [![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/Retrieval_based_Voice_Conversion_WebUI.ipynb) 6 | [![Licence](https://img.shields.io/github/license/liujing04/Retrieval-based-Voice-Conversion-WebUI?style=for-the-badge)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/%E4%BD%BF%E7%94%A8%E9%9C%80%E9%81%B5%E5%AE%88%E7%9A%84%E5%8D%8F%E8%AE%AE-LICENSE.txt) 7 | [![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-blue.svg?style=for-the-badge)](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/) 8 | 9 | ### Realtime Voice Conversion Software using RVC : [w-okada/voice-changer](https://github.com/w-okada/voice-changer) 10 | ------ 11 | 12 | An easy-to-use SVC framework based on VITS. 13 | 14 | [**English**](./README.md) | [**中文简体**](./README_zh_CN.md) 15 | 16 | > Check our [Demo Video](https://www.bilibili.com/video/BV1pm4y1z7Gm/) here! 17 | ## Summary 18 | This repository has the following features: 19 | + Using top1 feature model retrieval to reduce tone leakage; 20 | + Easy and fast training, even on relatively poor graphics cards; 21 | + Training with a small amount of data also obtains relatively good results; 22 | + Supporting model fusion to change timbres; 23 | + Easy-to-use Webui interface; 24 | + Use the UVR5 model to quickly separate vocals and instruments. 25 | + The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source, and high quality licensed song datasets will be added one after another for your use, without worrying about copyright infringement. 26 | ## Preparing the environment 27 | We recommend you install the dependencies through poetry. 28 | 29 | The following commands need to be executed in the environment of Python version 3.8 or higher: 30 | ```bash 31 | # Install PyTorch-related core dependencies, skip if installed 32 | # Reference: https://pytorch.org/get-started/locally/ 33 | pip install torch torchvision torchaudio 34 | 35 | #For Windows + 30-series Nvidia cards, you need to specify the cuda version corresponding to pytorch according to the experience of https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21 36 | 37 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 38 | 39 | # Install the Poetry dependency management tool, skip if installed 40 | # Reference: https://python-poetry.org/docs/#installation 41 | curl -sSL https://install.python-poetry.org | python3 - 42 | 43 | # Install the project dependencies 44 | poetry install 45 | ``` 46 | You can also use pip to install the dependencies 47 | 48 | **Notice**: `faiss 1.7.2` will raise Segmentation Fault: 11 under `MacOS`, please change corresponding line in `requirements.txt` to `faiss-cpu==1.7.0` 49 | 50 | ```bash 51 | pip install -r requirements.txt 52 | ``` 53 | 54 | ## Preparation of other Pre-models 55 | RVC requires other pre-models to infer and train. 56 | 57 | You need to download them from our [Huggingface space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/). 58 | 59 | Here's a list of Pre-models and other files that RVC needs: 60 | ```bash 61 | hubert_base.pt 62 | 63 | ./pretrained 64 | 65 | ./uvr5_weights 66 | 67 | #If you are using Windows, you may also need this dictionary, skip if FFmpeg is installed 68 | ffmpeg.exe 69 | ``` 70 | Then use this command to start Webui: 71 | ```bash 72 | python infer-web.py 73 | ``` 74 | If you are using Windows, you can download and extract `RVC-beta.7z` to use RVC directly and use `go-web.bat` to start Webui. 75 | 76 | We will develop an English version of the WebUI in 2 weeks. 77 | 78 | There's also a tutorial on RVC in Chinese and you can check it out if needed. 79 | 80 | ## Credits 81 | 82 | ## Thanks to all contributors for their efforts 83 | 84 | 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /RVC改进意见.txt: -------------------------------------------------------------------------------- 1 | ToDo: 2 | 3 | 停车按钮 4 | 5 | 根据每E时间推测训练剩余时间 6 | 7 | 记录点Demo: 8 | 推理时可以选择哪些记录点,然后批量自动推理出demo以便对比节点过拟合和欠拟合情况 9 | 训练时可以自动推理每个保存节点的Demo便于实时听过拟合和欠拟合[可单独选择一张推理用卡] 10 | 11 | 训练队列: 12 | 可以队列训练列表,训练结束后自动进行下一个训练 13 | 14 | 配置文件保存: 15 | WebUI的预设可以保存为配置文件,下次启动时自动读取 16 | 17 | 推理自动选择特征库检索文件 18 | 19 | Epoch和保存频率、Batch size等可以从滑条改为一个纵向的输入数字的配置面板 20 | 21 | WebUI可以重新布局? 详情参考目录下的WebUI_参考(目前尚未建立) 22 | 23 | 模型推理可以做成单次拖拽类的 24 | 25 | 26 | 个人的小想法: 27 | 可以试着接入一些类似于Vocaloid的工程文件来读取F0音高曲线? 28 | 比如SV,ACE,Vocaloid,Cevio Studio这种歌声合成软件 29 | 然后再给到f0编辑器(如果有了) 30 | 31 | 能暴露接口然后可以用QT做个桌面程序?毕竟QT也是跨平台的 32 | 可以给到一个端口,让他们在云端跑,本地跑这个QT程序桌面程序来控制云端的训练和推理? 33 | 34 | 35 | 36 | IsDo: -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | parser = argparse.ArgumentParser() 3 | parser.add_argument("--port", type=int, default=7865, help="Listen port") 4 | parser.add_argument("--pycmd", type=str, default="python", help="Python command") 5 | parser.add_argument("--colab", action='store_true', help="Launch in colab") 6 | parser.add_argument("--noparallel", action='store_true', help="Disable parallel processing") 7 | cmd_opts = parser.parse_args() 8 | ############离线VC参数 9 | inp_root=r"白鹭霜华长条"#对输入目录下所有音频进行转换,别放非音频文件 10 | opt_root=r"opt"#输出目录 11 | f0_up_key=0#升降调,整数,男转女12,女转男-12 12 | person=r"weights\洛天依v3.pt"#目前只有洛天依v3 13 | ############硬件参数 14 | device = "cuda:0"#填写cuda:x或cpu,x指代第几张卡,只支持N卡加速 15 | is_half=True#9-10-20-30-40系显卡无脑True,不影响质量,>=20显卡开启有加速 16 | n_cpu=0#默认0用上所有线程,写数字限制CPU资源使用 17 | ############python命令路径 18 | python_cmd=cmd_opts.pycmd 19 | listen_port=cmd_opts.port 20 | iscolab=cmd_opts.colab 21 | noparallel=cmd_opts.noparallel 22 | ############下头别动 23 | import torch 24 | if(torch.cuda.is_available()==False): 25 | print("没有发现支持的N卡, 使用CPU进行推理") 26 | device="cpu" 27 | is_half=False 28 | if(device!="cpu"): 29 | gpu_name=torch.cuda.get_device_name(int(device.split(":")[-1])) 30 | if("16"in gpu_name or "MX"in gpu_name): 31 | print("16系显卡/MX系显卡强制单精度") 32 | is_half=False 33 | from multiprocessing import cpu_count 34 | if(n_cpu==0):n_cpu=cpu_count() 35 | if(is_half==True): 36 | #6G显存配置 37 | x_pad = 3 38 | x_query = 10 39 | x_center = 60 40 | x_max = 65 41 | else: 42 | #5G显存配置 43 | x_pad = 1 44 | # x_query = 6 45 | # x_center = 30 46 | # x_max = 32 47 | #6G显存配置 48 | x_query = 6 49 | x_center = 38 50 | x_max = 41 51 | -------------------------------------------------------------------------------- /configs/32k.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 200, 4 | "seed": 1234, 5 | "epochs": 20000, 6 | "learning_rate": 1e-4, 7 | "betas": [0.8, 0.99], 8 | "eps": 1e-9, 9 | "batch_size": 4, 10 | "fp16_run": true, 11 | "lr_decay": 0.999875, 12 | "segment_size": 12800, 13 | "init_lr_ratio": 1, 14 | "warmup_epochs": 0, 15 | "c_mel": 45, 16 | "c_kl": 1.0 17 | }, 18 | "data": { 19 | "max_wav_value": 32768.0, 20 | "sampling_rate": 32000, 21 | "filter_length": 1024, 22 | "hop_length": 320, 23 | "win_length": 1024, 24 | "n_mel_channels": 80, 25 | "mel_fmin": 0.0, 26 | "mel_fmax": null 27 | }, 28 | "model": { 29 | "inter_channels": 192, 30 | "hidden_channels": 192, 31 | "filter_channels": 768, 32 | "n_heads": 2, 33 | "n_layers": 6, 34 | "kernel_size": 3, 35 | "p_dropout": 0, 36 | "resblock": "1", 37 | "resblock_kernel_sizes": [3,7,11], 38 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 39 | "upsample_rates": [10,4,2,2,2], 40 | "upsample_initial_channel": 512, 41 | "upsample_kernel_sizes": [16,16,4,4,4], 42 | "use_spectral_norm": false, 43 | "gin_channels": 256, 44 | "spk_embed_dim": 109 45 | } 46 | } 47 | -------------------------------------------------------------------------------- /configs/40k.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 200, 4 | "seed": 1234, 5 | "epochs": 20000, 6 | "learning_rate": 1e-4, 7 | "betas": [0.8, 0.99], 8 | "eps": 1e-9, 9 | "batch_size": 4, 10 | "fp16_run": true, 11 | "lr_decay": 0.999875, 12 | "segment_size": 12800, 13 | "init_lr_ratio": 1, 14 | "warmup_epochs": 0, 15 | "c_mel": 45, 16 | "c_kl": 1.0 17 | }, 18 | "data": { 19 | "max_wav_value": 32768.0, 20 | "sampling_rate": 40000, 21 | "filter_length": 2048, 22 | "hop_length": 400, 23 | "win_length": 2048, 24 | "n_mel_channels": 125, 25 | "mel_fmin": 0.0, 26 | "mel_fmax": null 27 | }, 28 | "model": { 29 | "inter_channels": 192, 30 | "hidden_channels": 192, 31 | "filter_channels": 768, 32 | "n_heads": 2, 33 | "n_layers": 6, 34 | "kernel_size": 3, 35 | "p_dropout": 0, 36 | "resblock": "1", 37 | "resblock_kernel_sizes": [3,7,11], 38 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 39 | "upsample_rates": [10,10,2,2], 40 | "upsample_initial_channel": 512, 41 | "upsample_kernel_sizes": [16,16,4,4], 42 | "use_spectral_norm": false, 43 | "gin_channels": 256, 44 | "spk_embed_dim": 109 45 | } 46 | } 47 | -------------------------------------------------------------------------------- /configs/48k.json: -------------------------------------------------------------------------------- 1 | { 2 | "train": { 3 | "log_interval": 200, 4 | "seed": 1234, 5 | "epochs": 20000, 6 | "learning_rate": 1e-4, 7 | "betas": [0.8, 0.99], 8 | "eps": 1e-9, 9 | "batch_size": 4, 10 | "fp16_run": true, 11 | "lr_decay": 0.999875, 12 | "segment_size": 11520, 13 | "init_lr_ratio": 1, 14 | "warmup_epochs": 0, 15 | "c_mel": 45, 16 | "c_kl": 1.0 17 | }, 18 | "data": { 19 | "max_wav_value": 32768.0, 20 | "sampling_rate": 48000, 21 | "filter_length": 2048, 22 | "hop_length": 480, 23 | "win_length": 2048, 24 | "n_mel_channels": 128, 25 | "mel_fmin": 0.0, 26 | "mel_fmax": null 27 | }, 28 | "model": { 29 | "inter_channels": 192, 30 | "hidden_channels": 192, 31 | "filter_channels": 768, 32 | "n_heads": 2, 33 | "n_layers": 6, 34 | "kernel_size": 3, 35 | "p_dropout": 0, 36 | "resblock": "1", 37 | "resblock_kernel_sizes": [3,7,11], 38 | "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], 39 | "upsample_rates": [10,6,2,2,2], 40 | "upsample_initial_channel": 512, 41 | "upsample_kernel_sizes": [16,16,4,4,4], 42 | "use_spectral_norm": false, 43 | "gin_channels": 256, 44 | "spk_embed_dim": 109 45 | } 46 | } 47 | -------------------------------------------------------------------------------- /envfilescheck.bat: -------------------------------------------------------------------------------- 1 | @echo off && chcp 65001 2 | 3 | echo working dir is %cd% 4 | echo downloading requirement aria2 check. 5 | echo= 6 | dir /a:d/b | findstr "aria2" > flag.txt 7 | findstr "aria2" flag.txt >nul 8 | if %errorlevel% ==0 ( 9 | echo aria2 checked. 10 | echo= 11 | ) else ( 12 | echo failed. please downloading aria2 from webpage! 13 | echo unzip it and put in this directory! 14 | timeout /T 5 15 | start https://github.com/aria2/aria2/releases/tag/release-1.36.0 16 | echo= 17 | goto end 18 | ) 19 | 20 | echo envfiles checking start. 21 | echo= 22 | 23 | for /f %%x in ('findstr /i /c:"aria2" "flag.txt"') do (set aria2=%%x)&goto endSch 24 | :endSch 25 | 26 | set d32=f0D32k.pth 27 | set d40=f0D40k.pth 28 | set d48=f0D48k.pth 29 | set g32=f0G32k.pth 30 | set g40=f0G40k.pth 31 | set g48=f0G48k.pth 32 | 33 | set dld32=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D32k.pth 34 | set dld40=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D40k.pth 35 | set dld48=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D48k.pth 36 | set dlg32=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G32k.pth 37 | set dlg40=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G40k.pth 38 | set dlg48=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G48k.pth 39 | 40 | set hp2=HP2-人声vocals+非人声instrumentals.pth 41 | set hp5=HP5-主旋律人声vocals+其他instrumentals.pth 42 | 43 | set dlhp2=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/uvr5_weights/HP2-人声vocals+非人声instrumentals.pth 44 | set dlhp5=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/uvr5_weights/HP5-主旋律人声vocals+其他instrumentals.pth 45 | 46 | set hb=hubert_base.pt 47 | 48 | set dlhb=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt 49 | 50 | echo dir check start. 51 | echo= 52 | 53 | if exist "%~dp0pretrained" ( 54 | echo dir .\pretrained checked. 55 | ) else ( 56 | echo failed. generating dir .\pretrained. 57 | mkdir pretrained 58 | ) 59 | if exist "%~dp0uvr5_weights" ( 60 | echo dir .\uvr5_weights checked. 61 | ) else ( 62 | echo failed. generating dir .\uvr5_weights. 63 | mkdir uvr5_weights 64 | ) 65 | 66 | echo= 67 | echo dir check finished. 68 | 69 | echo= 70 | echo required files check start. 71 | 72 | echo checking D32k.pth 73 | if exist "%~dp0pretrained\D32k.pth" ( 74 | echo D32k.pth in .\pretrained checked. 75 | echo= 76 | ) else ( 77 | echo failed. starting download from huggingface. 78 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D32k.pth -d %~dp0pretrained -o D32k.pth 79 | if exist "%~dp0pretrained\D32k.pth" (echo download successful.) else (echo please try again! 80 | echo=) 81 | ) 82 | echo checking D40k.pth 83 | if exist "%~dp0pretrained\D40k.pth" ( 84 | echo D40k.pth in .\pretrained checked. 85 | echo= 86 | ) else ( 87 | echo failed. starting download from huggingface. 88 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D40k.pth -d %~dp0pretrained -o D40k.pth 89 | if exist "%~dp0pretrained\D40k.pth" (echo download successful.) else (echo please try again! 90 | echo=) 91 | ) 92 | echo checking D48k.pth 93 | if exist "%~dp0pretrained\D48k.pth" ( 94 | echo D48k.pth in .\pretrained checked. 95 | echo= 96 | ) else ( 97 | echo failed. starting download from huggingface. 98 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D48k.pth -d %~dp0pretrained -o D48k.pth 99 | if exist "%~dp0pretrained\D48k.pth" (echo download successful.) else (echo please try again! 100 | echo=) 101 | ) 102 | echo checking G32k.pth 103 | if exist "%~dp0pretrained\G32k.pth" ( 104 | echo G32k.pth in .\pretrained checked. 105 | echo= 106 | ) else ( 107 | echo failed. starting download from huggingface. 108 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G32k.pth -d %~dp0pretrained -o G32k.pth 109 | if exist "%~dp0pretrained\G32k.pth" (echo download successful.) else (echo please try again! 110 | echo=) 111 | ) 112 | echo checking G40k.pth 113 | if exist "%~dp0pretrained\G40k.pth" ( 114 | echo G40k.pth in .\pretrained checked. 115 | echo= 116 | ) else ( 117 | echo failed. starting download from huggingface. 118 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G40k.pth -d %~dp0pretrained -o G40k.pth 119 | if exist "%~dp0pretrained\G40k.pth" (echo download successful.) else (echo please try again! 120 | echo=) 121 | ) 122 | echo checking G48k.pth 123 | if exist "%~dp0pretrained\G48k.pth" ( 124 | echo G48k.pth in .\pretrained checked. 125 | echo= 126 | ) else ( 127 | echo failed. starting download from huggingface. 128 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G48k.pth -d %~dp0pretrained -o G48k.pth 129 | if exist "%~dp0pretrained\G48k.pth" (echo download successful.) else (echo please try again! 130 | echo=) 131 | ) 132 | 133 | echo checking %d32% 134 | if exist "%~dp0pretrained\%d32%" ( 135 | echo %d32% in .\pretrained checked. 136 | echo= 137 | ) else ( 138 | echo failed. starting download from huggingface. 139 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dld32% -d %~dp0pretrained -o %d32% 140 | if exist "%~dp0pretrained\%d32%" (echo download successful.) else (echo please try again! 141 | echo=) 142 | ) 143 | echo checking %d40% 144 | if exist "%~dp0pretrained\%d40%" ( 145 | echo %d40% in .\pretrained checked. 146 | echo= 147 | ) else ( 148 | echo failed. starting download from huggingface. 149 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dld40% -d %~dp0pretrained -o %d40% 150 | if exist "%~dp0pretrained\%d40%" (echo download successful.) else (echo please try again! 151 | echo=) 152 | ) 153 | echo checking %d48% 154 | if exist "%~dp0pretrained\%d48%" ( 155 | echo %d48% in .\pretrained checked. 156 | echo= 157 | ) else ( 158 | echo failed. starting download from huggingface. 159 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dld48% -d %~dp0pretrained -o %d48% 160 | if exist "%~dp0pretrained\%d48%" (echo download successful.) else (echo please try again! 161 | echo=) 162 | ) 163 | echo checking %g32% 164 | if exist "%~dp0pretrained\%g32%" ( 165 | echo %g32% in .\pretrained checked. 166 | echo= 167 | ) else ( 168 | echo failed. starting download from huggingface. 169 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlg32% -d %~dp0pretrained -o %g32% 170 | if exist "%~dp0pretrained\%g32%" (echo download successful.) else (echo please try again! 171 | echo=) 172 | ) 173 | echo checking %g40% 174 | if exist "%~dp0pretrained\%g40%" ( 175 | echo %g40% in .\pretrained checked. 176 | echo= 177 | ) else ( 178 | echo failed. starting download from huggingface. 179 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlg40% -d %~dp0pretrained -o %g40% 180 | if exist "%~dp0pretrained\%g40%" (echo download successful.) else (echo please try again! 181 | echo=) 182 | ) 183 | echo checking %g48% 184 | if exist "%~dp0pretrained\%g48%" ( 185 | echo %g48% in .\pretrained checked. 186 | echo= 187 | ) else ( 188 | echo failed. starting download from huggingface. 189 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlg48% -d %~dp0\pretrained -o %g48% 190 | if exist "%~dp0pretrained\%g48%" (echo download successful.) else (echo please try again! 191 | echo=) 192 | ) 193 | 194 | echo checking %hp2% 195 | if exist "%~dp0uvr5_weights\%hp2%" ( 196 | echo %hp2% in .\uvr5_weights checked. 197 | echo= 198 | ) else ( 199 | echo failed. starting download from huggingface. 200 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlhp2% -d %~dp0\uvr5_weights -o %hp2% 201 | if exist "%~dp0uvr5_weights\%hp2%" (echo download successful.) else (echo please try again! 202 | echo=) 203 | ) 204 | echo checking %hp5% 205 | if exist "%~dp0uvr5_weights\%hp5%" ( 206 | echo %hp5% in .\uvr5_weights checked. 207 | echo= 208 | ) else ( 209 | echo failed. starting download from huggingface. 210 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlhp5% -d %~dp0\uvr5_weights -o %HP5% 211 | if exist "%~dp0uvr5_weights\%hp5%" (echo download successful.) else (echo please try again! 212 | echo=) 213 | ) 214 | 215 | echo checking %hb% 216 | if exist "%~dp0%hb%" ( 217 | echo %hb% in .\pretrained checked. 218 | echo= 219 | ) else ( 220 | echo failed. starting download from huggingface. 221 | %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlhb% -d %~dp0 -o %hb% 222 | if exist "%~dp0%hb%" (echo download successful.) else (echo please try again! 223 | echo=) 224 | ) 225 | 226 | echo required files check finished. 227 | echo envfiles check complete. 228 | pause 229 | :end 230 | del flag.txt -------------------------------------------------------------------------------- /export_onnx.py: -------------------------------------------------------------------------------- 1 | from infer_pack.models_onnx import SynthesizerTrnMs256NSFsid 2 | import torch 3 | 4 | person = "Shiroha/shiroha.pth" 5 | exported_path = "model.onnx" 6 | 7 | 8 | 9 | cpt = torch.load(person, map_location="cpu") 10 | cpt["config"][-3]=cpt["weight"]["emb_g.weight"].shape[0]#n_spk 11 | print(*cpt["config"]) 12 | net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=False) 13 | net_g.load_state_dict(cpt["weight"], strict=False) 14 | 15 | test_phone = torch.rand(1, 200, 256) 16 | test_phone_lengths = torch.tensor([200]).long() 17 | test_pitch = torch.randint(size=(1 ,200),low=5,high=255) 18 | test_pitchf = torch.rand(1, 200) 19 | test_ds = torch.LongTensor([0]) 20 | test_rnd = torch.rand(1, 192, 200) 21 | input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"] 22 | output_names = ["audio", ] 23 | device="cpu" 24 | torch.onnx.export(net_g, 25 | ( 26 | test_phone.to(device), 27 | test_phone_lengths.to(device), 28 | test_pitch.to(device), 29 | test_pitchf.to(device), 30 | test_ds.to(device), 31 | test_rnd.to(device) 32 | ), 33 | exported_path, 34 | dynamic_axes={ 35 | "phone": [1], 36 | "pitch": [1], 37 | "pitchf": [1], 38 | "rnd": [2], 39 | }, 40 | do_constant_folding=False, 41 | opset_version=16, 42 | verbose=False, 43 | input_names=input_names, 44 | output_names=output_names) -------------------------------------------------------------------------------- /extract_f0_print.py: -------------------------------------------------------------------------------- 1 | import os,traceback,sys,parselmouth 2 | import librosa 3 | import pyworld 4 | from scipy.io import wavfile 5 | import numpy as np,logging 6 | logging.getLogger('numba').setLevel(logging.WARNING) 7 | from multiprocessing import Process 8 | 9 | exp_dir = sys.argv[1] 10 | f = open("%s/extract_f0_feature.log"%exp_dir, "a+") 11 | def printt(strr): 12 | print(strr) 13 | f.write("%s\n" % strr) 14 | f.flush() 15 | 16 | n_p = int(sys.argv[2]) 17 | f0method = sys.argv[3] 18 | 19 | class FeatureInput(object): 20 | def __init__(self, samplerate=16000, hop_size=160): 21 | self.fs = samplerate 22 | self.hop = hop_size 23 | 24 | self.f0_bin = 256 25 | self.f0_max = 1100.0 26 | self.f0_min = 50.0 27 | self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700) 28 | self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700) 29 | 30 | def compute_f0(self, path,f0_method): 31 | x, sr = librosa.load(path, self.fs) 32 | p_len=x.shape[0]//self.hop 33 | assert sr == self.fs 34 | if(f0_method=="pm"): 35 | time_step = 160 / 16000 * 1000 36 | f0_min = 50 37 | f0_max = 1100 38 | f0 = parselmouth.Sound(x, sr).to_pitch_ac( 39 | time_step=time_step / 1000, voicing_threshold=0.6, 40 | pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency'] 41 | pad_size=(p_len - len(f0) + 1) // 2 42 | if(pad_size>0 or p_len - len(f0) - pad_size>0): 43 | f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant') 44 | elif(f0_method=="harvest"): 45 | f0, t = pyworld.harvest( 46 | x.astype(np.double), 47 | fs=sr, 48 | f0_ceil=1100, 49 | frame_period=1000 * self.hop / sr, 50 | ) 51 | f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs) 52 | elif(f0_method=="dio"): 53 | f0, t = pyworld.dio( 54 | x.astype(np.double), 55 | fs=sr, 56 | f0_ceil=1100, 57 | frame_period=1000 * self.hop / sr, 58 | ) 59 | f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs) 60 | return f0 61 | 62 | def coarse_f0(self, f0): 63 | f0_mel = 1127 * np.log(1 + f0 / 700) 64 | f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * ( 65 | self.f0_bin - 2 66 | ) / (self.f0_mel_max - self.f0_mel_min) + 1 67 | 68 | # use 0 or 1 69 | f0_mel[f0_mel <= 1] = 1 70 | f0_mel[f0_mel > self.f0_bin - 1] = self.f0_bin - 1 71 | f0_coarse = np.rint(f0_mel).astype(np.int) 72 | assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, ( 73 | f0_coarse.max(), 74 | f0_coarse.min(), 75 | ) 76 | return f0_coarse 77 | 78 | def go(self,paths,f0_method): 79 | if (len(paths) == 0): printt("no-f0-todo") 80 | else: 81 | printt("todo-f0-%s"%len(paths)) 82 | n=max(len(paths)//5,1)#每个进程最多打印5条 83 | for idx,(inp_path,opt_path1,opt_path2) in enumerate(paths): 84 | try: 85 | if(idx%n==0):printt("f0ing,now-%s,all-%s,-%s"%(idx,len(paths),inp_path)) 86 | if(os.path.exists(opt_path1+".npy")==True and os.path.exists(opt_path2+".npy")==True):continue 87 | featur_pit = self.compute_f0(inp_path,f0_method) 88 | np.save(opt_path2,featur_pit,allow_pickle=False,)#nsf 89 | coarse_pit = self.coarse_f0(featur_pit) 90 | np.save(opt_path1,coarse_pit,allow_pickle=False,)#ori 91 | except: 92 | printt("f0fail-%s-%s-%s" % (idx, inp_path,traceback.format_exc())) 93 | 94 | if __name__=='__main__': 95 | # exp_dir=r"E:\codes\py39\dataset\mi-test" 96 | # n_p=16 97 | # f = open("%s/log_extract_f0.log"%exp_dir, "w") 98 | printt(sys.argv) 99 | featureInput = FeatureInput() 100 | paths=[] 101 | inp_root= "%s/1_16k_wavs"%(exp_dir) 102 | opt_root1="%s/2a_f0"%(exp_dir) 103 | opt_root2="%s/2b-f0nsf"%(exp_dir) 104 | 105 | os.makedirs(opt_root1,exist_ok=True) 106 | os.makedirs(opt_root2,exist_ok=True) 107 | for name in sorted(list(os.listdir(inp_root))): 108 | inp_path="%s/%s"%(inp_root,name) 109 | if ("spec" in inp_path): continue 110 | opt_path1="%s/%s"%(opt_root1,name) 111 | opt_path2="%s/%s"%(opt_root2,name) 112 | paths.append([inp_path,opt_path1,opt_path2]) 113 | 114 | ps=[] 115 | for i in range(n_p): 116 | p=Process(target=featureInput.go,args=(paths[i::n_p],f0method,)) 117 | p.start() 118 | ps.append(p) 119 | for p in ps: 120 | p.join() 121 | -------------------------------------------------------------------------------- /extract_feature_print.py: -------------------------------------------------------------------------------- 1 | import os,sys,traceback 2 | if len(sys.argv) == 4: 3 | n_part=int(sys.argv[1]) 4 | i_part=int(sys.argv[2]) 5 | exp_dir=sys.argv[3] 6 | else: 7 | n_part=int(sys.argv[1]) 8 | i_part=int(sys.argv[2]) 9 | i_gpu=sys.argv[3] 10 | exp_dir=sys.argv[4] 11 | os.environ["CUDA_VISIBLE_DEVICES"]=str(i_gpu) 12 | 13 | import torch 14 | import torch.nn.functional as F 15 | import soundfile as sf 16 | import numpy as np 17 | from fairseq import checkpoint_utils 18 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 19 | 20 | f = open("%s/extract_f0_feature.log"%exp_dir, "a+") 21 | def printt(strr): 22 | print(strr) 23 | f.write("%s\n" % strr) 24 | f.flush() 25 | printt(sys.argv) 26 | model_path = "hubert_base.pt" 27 | 28 | printt(exp_dir) 29 | wavPath = "%s/1_16k_wavs"%exp_dir 30 | outPath = "%s/3_feature256"%exp_dir 31 | os.makedirs(outPath,exist_ok=True) 32 | # wave must be 16k, hop_size=320 33 | def readwave(wav_path, normalize=False): 34 | wav, sr = sf.read(wav_path) 35 | assert sr == 16000 36 | feats = torch.from_numpy(wav).float() 37 | if feats.dim() == 2: # double channels 38 | feats = feats.mean(-1) 39 | assert feats.dim() == 1, feats.dim() 40 | if normalize: 41 | with torch.no_grad(): 42 | feats = F.layer_norm(feats, feats.shape) 43 | feats = feats.view(1, -1) 44 | return feats 45 | # HuBERT model 46 | printt("load model(s) from {}".format(model_path)) 47 | models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task( 48 | [model_path], 49 | suffix="", 50 | ) 51 | model = models[0] 52 | model = model.to(device) 53 | if torch.cuda.is_available(): 54 | model = model.half() 55 | model.eval() 56 | 57 | todo=sorted(list(os.listdir(wavPath)))[i_part::n_part] 58 | n = max(1,len(todo) // 10) # 最多打印十条 59 | if(len(todo)==0):printt("no-feature-todo") 60 | else: 61 | printt("all-feature-%s"%len(todo)) 62 | for idx,file in enumerate(todo): 63 | try: 64 | if file.endswith(".wav"): 65 | wav_path = "%s/%s"%(wavPath,file) 66 | out_path = "%s/%s"%(outPath,file.replace("wav","npy")) 67 | 68 | if(os.path.exists(out_path)):continue 69 | 70 | feats = readwave(wav_path, normalize=saved_cfg.task.normalize) 71 | padding_mask = torch.BoolTensor(feats.shape).fill_(False) 72 | inputs = { 73 | "source": feats.half().to(device) if torch.cuda.is_available() else feats.to(device), 74 | "padding_mask": padding_mask.to(device), 75 | "output_layer": 9, # layer 9 76 | } 77 | with torch.no_grad(): 78 | logits = model.extract_features(**inputs) 79 | feats = model.final_proj(logits[0]) 80 | 81 | feats = feats.squeeze(0).float().cpu().numpy() 82 | if(np.isnan(feats).sum()==0): 83 | np.save(out_path, feats, allow_pickle=False) 84 | else: 85 | printt("%s-contains nan"%file) 86 | if (idx % n == 0):printt("now-%s,all-%s,%s,%s"%(len(todo),idx,file,feats.shape)) 87 | except: 88 | printt(traceback.format_exc()) 89 | printt("all-feature-done") 90 | -------------------------------------------------------------------------------- /go-web_jp.bat: -------------------------------------------------------------------------------- 1 | runtime\python.exe infer-web_jp.py -------------------------------------------------------------------------------- /infer/infer-pm-index256.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 3 | 对源特征进行检索 4 | ''' 5 | import torch, pdb, os,parselmouth 6 | os.environ["CUDA_VISIBLE_DEVICES"]="0" 7 | import numpy as np 8 | import soundfile as sf 9 | # from models import SynthesizerTrn256#hifigan_nonsf 10 | # from infer_pack.models import SynthesizerTrn256NSF as SynthesizerTrn256#hifigan_nsf 11 | from infer_pack.models import SynthesizerTrnMs256NSFsid as SynthesizerTrn256#hifigan_nsf 12 | # from infer_pack.models import SynthesizerTrnMs256NSFsid_sim as SynthesizerTrn256#hifigan_nsf 13 | # from models import SynthesizerTrn256NSFsim as SynthesizerTrn256#hifigan_nsf 14 | # from models import SynthesizerTrn256NSFsimFlow as SynthesizerTrn256#hifigan_nsf 15 | 16 | 17 | from scipy.io import wavfile 18 | from fairseq import checkpoint_utils 19 | # import pyworld 20 | import librosa 21 | import torch.nn.functional as F 22 | import scipy.signal as signal 23 | # import torchcrepe 24 | from time import time as ttime 25 | 26 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 27 | model_path = r"E:\codes\py39\vits_vc_gpu_train\hubert_base.pt"# 28 | print("load model(s) from {}".format(model_path)) 29 | models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task( 30 | [model_path], 31 | suffix="", 32 | ) 33 | model = models[0] 34 | model = model.to(device) 35 | model = model.half() 36 | model.eval() 37 | 38 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],183,256,is_half=True)#hifigan#512#256 39 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],109,256,is_half=True)#hifigan#512#256 40 | net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],183,256,is_half=True)#hifigan#512#256#no_dropout 41 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,3,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],0)#ts3 42 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2],512,[16,16,4],0)#hifigan-ps-sr 43 | # 44 | # net_g = SynthesizerTrn(1025, 32, 192, 192, 768, 2, 6, 3, 0.1, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [5,5], 512, [15,15], 0)#ms 45 | # net_g = SynthesizerTrn(1025, 32, 192, 192, 768, 2, 6, 3, 0.1, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,10], 512, [16,16], 0)#idwt2 46 | 47 | # weights=torch.load("infer/ft-mi_1k-noD.pt") 48 | # weights=torch.load("infer/ft-mi-freeze-vocoder-flow-enc_q_1k.pt") 49 | # weights=torch.load("infer/ft-mi-freeze-vocoder_true_1k.pt") 50 | # weights=torch.load("infer/ft-mi-sim1k.pt") 51 | weights=torch.load("infer/ft-mi-no_opt-no_dropout.pt") 52 | print(net_g.load_state_dict(weights,strict=True)) 53 | 54 | net_g.eval().to(device) 55 | net_g.half() 56 | def get_f0(x, p_len,f0_up_key=0): 57 | 58 | time_step = 160 / 16000 * 1000 59 | f0_min = 50 60 | f0_max = 1100 61 | f0_mel_min = 1127 * np.log(1 + f0_min / 700) 62 | f0_mel_max = 1127 * np.log(1 + f0_max / 700) 63 | 64 | f0 = parselmouth.Sound(x, 16000).to_pitch_ac( 65 | time_step=time_step / 1000, voicing_threshold=0.6, 66 | pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency'] 67 | 68 | pad_size=(p_len - len(f0) + 1) // 2 69 | if(pad_size>0 or p_len - len(f0) - pad_size>0): 70 | f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant') 71 | f0 *= pow(2, f0_up_key / 12) 72 | f0bak = f0.copy() 73 | 74 | f0_mel = 1127 * np.log(1 + f0 / 700) 75 | f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (f0_mel_max - f0_mel_min) + 1 76 | f0_mel[f0_mel <= 1] = 1 77 | f0_mel[f0_mel > 255] = 255 78 | # f0_mel[f0_mel > 188] = 188 79 | f0_coarse = np.rint(f0_mel).astype(np.int) 80 | return f0_coarse, f0bak 81 | 82 | import faiss 83 | index=faiss.read_index("infer/added_IVF512_Flat_mi_baseline_src_feat.index") 84 | big_npy=np.load("infer/big_src_feature_mi.npy") 85 | ta0=ta1=ta2=0 86 | for idx,name in enumerate(["冬之花clip1.wav",]):## 87 | wav_path = "todo-songs/%s" % name# 88 | f0_up_key=-2# 89 | audio, sampling_rate = sf.read(wav_path) 90 | if len(audio.shape) > 1: 91 | audio = librosa.to_mono(audio.transpose(1, 0)) 92 | if sampling_rate != 16000: 93 | audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=16000) 94 | 95 | 96 | feats = torch.from_numpy(audio).float() 97 | if feats.dim() == 2: # double channels 98 | feats = feats.mean(-1) 99 | assert feats.dim() == 1, feats.dim() 100 | feats = feats.view(1, -1) 101 | padding_mask = torch.BoolTensor(feats.shape).fill_(False) 102 | inputs = { 103 | "source": feats.half().to(device), 104 | "padding_mask": padding_mask.to(device), 105 | "output_layer": 9, # layer 9 106 | } 107 | if torch.cuda.is_available(): torch.cuda.synchronize() 108 | t0=ttime() 109 | with torch.no_grad(): 110 | logits = model.extract_features(**inputs) 111 | feats = model.final_proj(logits[0]) 112 | 113 | ####索引优化 114 | npy = feats[0].cpu().numpy().astype("float32") 115 | D, I = index.search(npy, 1) 116 | feats = torch.from_numpy(big_npy[I.squeeze()].astype("float16")).unsqueeze(0).to(device) 117 | 118 | feats=F.interpolate(feats.permute(0,2,1),scale_factor=2).permute(0,2,1) 119 | if torch.cuda.is_available(): torch.cuda.synchronize() 120 | t1=ttime() 121 | # p_len = min(feats.shape[1],10000,pitch.shape[0])#太大了爆显存 122 | p_len = min(feats.shape[1],10000)# 123 | pitch, pitchf = get_f0(audio, p_len,f0_up_key) 124 | p_len = min(feats.shape[1],10000,pitch.shape[0])#太大了爆显存 125 | if torch.cuda.is_available(): torch.cuda.synchronize() 126 | t2=ttime() 127 | feats = feats[:,:p_len, :] 128 | pitch = pitch[:p_len] 129 | pitchf = pitchf[:p_len] 130 | p_len = torch.LongTensor([p_len]).to(device) 131 | pitch = torch.LongTensor(pitch).unsqueeze(0).to(device) 132 | sid=torch.LongTensor([0]).to(device) 133 | pitchf = torch.FloatTensor(pitchf).unsqueeze(0).to(device) 134 | with torch.no_grad(): 135 | audio = net_g.infer(feats, p_len,pitch,pitchf,sid)[0][0, 0].data.cpu().float().numpy()#nsf 136 | if torch.cuda.is_available(): torch.cuda.synchronize() 137 | t3=ttime() 138 | ta0+=(t1-t0) 139 | ta1+=(t2-t1) 140 | ta2+=(t3-t2) 141 | # wavfile.write("ft-mi_1k-index256-noD-%s.wav"%name, 40000, audio)## 142 | # wavfile.write("ft-mi-freeze-vocoder-flow-enc_q_1k-%s.wav"%name, 40000, audio)## 143 | # wavfile.write("ft-mi-sim1k-%s.wav"%name, 40000, audio)## 144 | wavfile.write("ft-mi-no_opt-no_dropout-%s.wav"%name, 40000, audio)## 145 | 146 | 147 | print(ta0,ta1,ta2)# 148 | -------------------------------------------------------------------------------- /infer/train-index.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 格式:直接cid为自带的index位;aid放不下了,通过字典来查,反正就5w个 3 | ''' 4 | import faiss,numpy as np,os 5 | 6 | # ###########如果是原始特征要先写save 7 | inp_root=r"E:\codes\py39\dataset\mi\2-co256" 8 | npys=[] 9 | for name in sorted(list(os.listdir(inp_root))): 10 | phone=np.load("%s/%s"%(inp_root,name)) 11 | npys.append(phone) 12 | big_npy=np.concatenate(npys,0) 13 | print(big_npy.shape)#(6196072, 192)#fp32#4.43G 14 | np.save("infer/big_src_feature_mi.npy",big_npy) 15 | 16 | ##################train+add 17 | # big_npy=np.load("/bili-coeus/jupyter/jupyterhub-liujing04/vits_ch/inference_f0/big_src_feature_mi.npy") 18 | print(big_npy.shape) 19 | index = faiss.index_factory(256, "IVF512,Flat")#mi 20 | print("training") 21 | index_ivf = faiss.extract_index_ivf(index)# 22 | index_ivf.nprobe = 9 23 | index.train(big_npy) 24 | faiss.write_index(index, 'infer/trained_IVF512_Flat_mi_baseline_src_feat.index') 25 | print("adding") 26 | index.add(big_npy) 27 | faiss.write_index(index,"infer/added_IVF512_Flat_mi_baseline_src_feat.index") 28 | ''' 29 | 大小(都是FP32) 30 | big_src_feature 2.95G 31 | (3098036, 256) 32 | big_emb 4.43G 33 | (6196072, 192) 34 | big_emb双倍是因为求特征要repeat后再加pitch 35 | 36 | ''' -------------------------------------------------------------------------------- /infer/trans_weights.py: -------------------------------------------------------------------------------- 1 | import torch,pdb 2 | 3 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-suc\G_1000.pth")["model"]#sim_nsf# 4 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-freeze-vocoder-flow-enc_q\G_1000.pth")["model"]#sim_nsf# 5 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-freeze-vocoder\G_1000.pth")["model"]#sim_nsf# 6 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-test\G_1000.pth")["model"]#sim_nsf# 7 | a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-no_opt-no_dropout\G_1000.pth")["model"]#sim_nsf# 8 | for key in a.keys():a[key]=a[key].half() 9 | # torch.save(a,"ft-mi-freeze-vocoder_true_1k.pt")# 10 | # torch.save(a,"ft-mi-sim1k.pt")# 11 | torch.save(a,"ft-mi-no_opt-no_dropout.pt")# 12 | -------------------------------------------------------------------------------- /infer_pack/commons.py: -------------------------------------------------------------------------------- 1 | import math 2 | import numpy as np 3 | import torch 4 | from torch import nn 5 | from torch.nn import functional as F 6 | 7 | 8 | def init_weights(m, mean=0.0, std=0.01): 9 | classname = m.__class__.__name__ 10 | if classname.find("Conv") != -1: 11 | m.weight.data.normal_(mean, std) 12 | 13 | 14 | def get_padding(kernel_size, dilation=1): 15 | return int((kernel_size * dilation - dilation) / 2) 16 | 17 | 18 | def convert_pad_shape(pad_shape): 19 | l = pad_shape[::-1] 20 | pad_shape = [item for sublist in l for item in sublist] 21 | return pad_shape 22 | 23 | 24 | def kl_divergence(m_p, logs_p, m_q, logs_q): 25 | """KL(P||Q)""" 26 | kl = (logs_q - logs_p) - 0.5 27 | kl += ( 28 | 0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q) 29 | ) 30 | return kl 31 | 32 | 33 | def rand_gumbel(shape): 34 | """Sample from the Gumbel distribution, protect from overflows.""" 35 | uniform_samples = torch.rand(shape) * 0.99998 + 0.00001 36 | return -torch.log(-torch.log(uniform_samples)) 37 | 38 | 39 | def rand_gumbel_like(x): 40 | g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device) 41 | return g 42 | 43 | 44 | def slice_segments(x, ids_str, segment_size=4): 45 | ret = torch.zeros_like(x[:, :, :segment_size]) 46 | for i in range(x.size(0)): 47 | idx_str = ids_str[i] 48 | idx_end = idx_str + segment_size 49 | ret[i] = x[i, :, idx_str:idx_end] 50 | return ret 51 | def slice_segments2(x, ids_str, segment_size=4): 52 | ret = torch.zeros_like(x[:, :segment_size]) 53 | for i in range(x.size(0)): 54 | idx_str = ids_str[i] 55 | idx_end = idx_str + segment_size 56 | ret[i] = x[i, idx_str:idx_end] 57 | return ret 58 | 59 | 60 | def rand_slice_segments(x, x_lengths=None, segment_size=4): 61 | b, d, t = x.size() 62 | if x_lengths is None: 63 | x_lengths = t 64 | ids_str_max = x_lengths - segment_size + 1 65 | ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long) 66 | ret = slice_segments(x, ids_str, segment_size) 67 | return ret, ids_str 68 | 69 | 70 | def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4): 71 | position = torch.arange(length, dtype=torch.float) 72 | num_timescales = channels // 2 73 | log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / ( 74 | num_timescales - 1 75 | ) 76 | inv_timescales = min_timescale * torch.exp( 77 | torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment 78 | ) 79 | scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1) 80 | signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0) 81 | signal = F.pad(signal, [0, 0, 0, channels % 2]) 82 | signal = signal.view(1, channels, length) 83 | return signal 84 | 85 | 86 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4): 87 | b, channels, length = x.size() 88 | signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale) 89 | return x + signal.to(dtype=x.dtype, device=x.device) 90 | 91 | 92 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1): 93 | b, channels, length = x.size() 94 | signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale) 95 | return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis) 96 | 97 | 98 | def subsequent_mask(length): 99 | mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0) 100 | return mask 101 | 102 | 103 | @torch.jit.script 104 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels): 105 | n_channels_int = n_channels[0] 106 | in_act = input_a + input_b 107 | t_act = torch.tanh(in_act[:, :n_channels_int, :]) 108 | s_act = torch.sigmoid(in_act[:, n_channels_int:, :]) 109 | acts = t_act * s_act 110 | return acts 111 | 112 | 113 | def convert_pad_shape(pad_shape): 114 | l = pad_shape[::-1] 115 | pad_shape = [item for sublist in l for item in sublist] 116 | return pad_shape 117 | 118 | 119 | def shift_1d(x): 120 | x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1] 121 | return x 122 | 123 | 124 | def sequence_mask(length, max_length=None): 125 | if max_length is None: 126 | max_length = length.max() 127 | x = torch.arange(max_length, dtype=length.dtype, device=length.device) 128 | return x.unsqueeze(0) < length.unsqueeze(1) 129 | 130 | 131 | def generate_path(duration, mask): 132 | """ 133 | duration: [b, 1, t_x] 134 | mask: [b, 1, t_y, t_x] 135 | """ 136 | device = duration.device 137 | 138 | b, _, t_y, t_x = mask.shape 139 | cum_duration = torch.cumsum(duration, -1) 140 | 141 | cum_duration_flat = cum_duration.view(b * t_x) 142 | path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype) 143 | path = path.view(b, t_x, t_y) 144 | path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1] 145 | path = path.unsqueeze(1).transpose(2, 3) * mask 146 | return path 147 | 148 | 149 | def clip_grad_value_(parameters, clip_value, norm_type=2): 150 | if isinstance(parameters, torch.Tensor): 151 | parameters = [parameters] 152 | parameters = list(filter(lambda p: p.grad is not None, parameters)) 153 | norm_type = float(norm_type) 154 | if clip_value is not None: 155 | clip_value = float(clip_value) 156 | 157 | total_norm = 0 158 | for p in parameters: 159 | param_norm = p.grad.data.norm(norm_type) 160 | total_norm += param_norm.item() ** norm_type 161 | if clip_value is not None: 162 | p.grad.data.clamp_(min=-clip_value, max=clip_value) 163 | total_norm = total_norm ** (1.0 / norm_type) 164 | return total_norm 165 | -------------------------------------------------------------------------------- /infer_pack/transforms.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.nn import functional as F 3 | 4 | import numpy as np 5 | 6 | 7 | DEFAULT_MIN_BIN_WIDTH = 1e-3 8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3 9 | DEFAULT_MIN_DERIVATIVE = 1e-3 10 | 11 | 12 | def piecewise_rational_quadratic_transform(inputs, 13 | unnormalized_widths, 14 | unnormalized_heights, 15 | unnormalized_derivatives, 16 | inverse=False, 17 | tails=None, 18 | tail_bound=1., 19 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 20 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 21 | min_derivative=DEFAULT_MIN_DERIVATIVE): 22 | 23 | if tails is None: 24 | spline_fn = rational_quadratic_spline 25 | spline_kwargs = {} 26 | else: 27 | spline_fn = unconstrained_rational_quadratic_spline 28 | spline_kwargs = { 29 | 'tails': tails, 30 | 'tail_bound': tail_bound 31 | } 32 | 33 | outputs, logabsdet = spline_fn( 34 | inputs=inputs, 35 | unnormalized_widths=unnormalized_widths, 36 | unnormalized_heights=unnormalized_heights, 37 | unnormalized_derivatives=unnormalized_derivatives, 38 | inverse=inverse, 39 | min_bin_width=min_bin_width, 40 | min_bin_height=min_bin_height, 41 | min_derivative=min_derivative, 42 | **spline_kwargs 43 | ) 44 | return outputs, logabsdet 45 | 46 | 47 | def searchsorted(bin_locations, inputs, eps=1e-6): 48 | bin_locations[..., -1] += eps 49 | return torch.sum( 50 | inputs[..., None] >= bin_locations, 51 | dim=-1 52 | ) - 1 53 | 54 | 55 | def unconstrained_rational_quadratic_spline(inputs, 56 | unnormalized_widths, 57 | unnormalized_heights, 58 | unnormalized_derivatives, 59 | inverse=False, 60 | tails='linear', 61 | tail_bound=1., 62 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 63 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 64 | min_derivative=DEFAULT_MIN_DERIVATIVE): 65 | inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound) 66 | outside_interval_mask = ~inside_interval_mask 67 | 68 | outputs = torch.zeros_like(inputs) 69 | logabsdet = torch.zeros_like(inputs) 70 | 71 | if tails == 'linear': 72 | unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1)) 73 | constant = np.log(np.exp(1 - min_derivative) - 1) 74 | unnormalized_derivatives[..., 0] = constant 75 | unnormalized_derivatives[..., -1] = constant 76 | 77 | outputs[outside_interval_mask] = inputs[outside_interval_mask] 78 | logabsdet[outside_interval_mask] = 0 79 | else: 80 | raise RuntimeError('{} tails are not implemented.'.format(tails)) 81 | 82 | outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline( 83 | inputs=inputs[inside_interval_mask], 84 | unnormalized_widths=unnormalized_widths[inside_interval_mask, :], 85 | unnormalized_heights=unnormalized_heights[inside_interval_mask, :], 86 | unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :], 87 | inverse=inverse, 88 | left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound, 89 | min_bin_width=min_bin_width, 90 | min_bin_height=min_bin_height, 91 | min_derivative=min_derivative 92 | ) 93 | 94 | return outputs, logabsdet 95 | 96 | def rational_quadratic_spline(inputs, 97 | unnormalized_widths, 98 | unnormalized_heights, 99 | unnormalized_derivatives, 100 | inverse=False, 101 | left=0., right=1., bottom=0., top=1., 102 | min_bin_width=DEFAULT_MIN_BIN_WIDTH, 103 | min_bin_height=DEFAULT_MIN_BIN_HEIGHT, 104 | min_derivative=DEFAULT_MIN_DERIVATIVE): 105 | if torch.min(inputs) < left or torch.max(inputs) > right: 106 | raise ValueError('Input to a transform is not within its domain') 107 | 108 | num_bins = unnormalized_widths.shape[-1] 109 | 110 | if min_bin_width * num_bins > 1.0: 111 | raise ValueError('Minimal bin width too large for the number of bins') 112 | if min_bin_height * num_bins > 1.0: 113 | raise ValueError('Minimal bin height too large for the number of bins') 114 | 115 | widths = F.softmax(unnormalized_widths, dim=-1) 116 | widths = min_bin_width + (1 - min_bin_width * num_bins) * widths 117 | cumwidths = torch.cumsum(widths, dim=-1) 118 | cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0) 119 | cumwidths = (right - left) * cumwidths + left 120 | cumwidths[..., 0] = left 121 | cumwidths[..., -1] = right 122 | widths = cumwidths[..., 1:] - cumwidths[..., :-1] 123 | 124 | derivatives = min_derivative + F.softplus(unnormalized_derivatives) 125 | 126 | heights = F.softmax(unnormalized_heights, dim=-1) 127 | heights = min_bin_height + (1 - min_bin_height * num_bins) * heights 128 | cumheights = torch.cumsum(heights, dim=-1) 129 | cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0) 130 | cumheights = (top - bottom) * cumheights + bottom 131 | cumheights[..., 0] = bottom 132 | cumheights[..., -1] = top 133 | heights = cumheights[..., 1:] - cumheights[..., :-1] 134 | 135 | if inverse: 136 | bin_idx = searchsorted(cumheights, inputs)[..., None] 137 | else: 138 | bin_idx = searchsorted(cumwidths, inputs)[..., None] 139 | 140 | input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0] 141 | input_bin_widths = widths.gather(-1, bin_idx)[..., 0] 142 | 143 | input_cumheights = cumheights.gather(-1, bin_idx)[..., 0] 144 | delta = heights / widths 145 | input_delta = delta.gather(-1, bin_idx)[..., 0] 146 | 147 | input_derivatives = derivatives.gather(-1, bin_idx)[..., 0] 148 | input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0] 149 | 150 | input_heights = heights.gather(-1, bin_idx)[..., 0] 151 | 152 | if inverse: 153 | a = (((inputs - input_cumheights) * (input_derivatives 154 | + input_derivatives_plus_one 155 | - 2 * input_delta) 156 | + input_heights * (input_delta - input_derivatives))) 157 | b = (input_heights * input_derivatives 158 | - (inputs - input_cumheights) * (input_derivatives 159 | + input_derivatives_plus_one 160 | - 2 * input_delta)) 161 | c = - input_delta * (inputs - input_cumheights) 162 | 163 | discriminant = b.pow(2) - 4 * a * c 164 | assert (discriminant >= 0).all() 165 | 166 | root = (2 * c) / (-b - torch.sqrt(discriminant)) 167 | outputs = root * input_bin_widths + input_cumwidths 168 | 169 | theta_one_minus_theta = root * (1 - root) 170 | denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta) 171 | * theta_one_minus_theta) 172 | derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2) 173 | + 2 * input_delta * theta_one_minus_theta 174 | + input_derivatives * (1 - root).pow(2)) 175 | logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator) 176 | 177 | return outputs, -logabsdet 178 | else: 179 | theta = (inputs - input_cumwidths) / input_bin_widths 180 | theta_one_minus_theta = theta * (1 - theta) 181 | 182 | numerator = input_heights * (input_delta * theta.pow(2) 183 | + input_derivatives * theta_one_minus_theta) 184 | denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta) 185 | * theta_one_minus_theta) 186 | outputs = input_cumheights + numerator / denominator 187 | 188 | derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2) 189 | + 2 * input_delta * theta_one_minus_theta 190 | + input_derivatives * (1 - theta).pow(2)) 191 | logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator) 192 | 193 | return outputs, logabsdet 194 | -------------------------------------------------------------------------------- /infer_uvr5.py: -------------------------------------------------------------------------------- 1 | import os,sys,torch,warnings,pdb 2 | warnings.filterwarnings("ignore") 3 | import librosa 4 | import importlib 5 | import numpy as np 6 | import hashlib , math 7 | from tqdm import tqdm 8 | from uvr5_pack.lib_v5 import spec_utils 9 | from uvr5_pack.utils import _get_name_params,inference 10 | from uvr5_pack.lib_v5.model_param_init import ModelParameters 11 | from scipy.io import wavfile 12 | 13 | class _audio_pre_(): 14 | def __init__(self, model_path,device,is_half): 15 | self.model_path = model_path 16 | self.device = device 17 | self.data = { 18 | # Processing Options 19 | 'postprocess': False, 20 | 'tta': False, 21 | # Constants 22 | 'window_size': 512, 23 | 'agg': 10, 24 | 'high_end_process': 'mirroring', 25 | } 26 | nn_arch_sizes = [ 27 | 31191, # default 28 | 33966,61968, 123821, 123812, 537238 # custom 29 | ] 30 | self.nn_architecture = list('{}KB'.format(s) for s in nn_arch_sizes) 31 | model_size = math.ceil(os.stat(model_path ).st_size / 1024) 32 | nn_architecture = '{}KB'.format(min(nn_arch_sizes, key=lambda x:abs(x-model_size))) 33 | nets = importlib.import_module('uvr5_pack.lib_v5.nets' + f'_{nn_architecture}'.replace('_{}KB'.format(nn_arch_sizes[0]), ''), package=None) 34 | model_hash = hashlib.md5(open(model_path,'rb').read()).hexdigest() 35 | param_name ,model_params_d = _get_name_params(model_path , model_hash) 36 | 37 | mp = ModelParameters(model_params_d) 38 | model = nets.CascadedASPPNet(mp.param['bins'] * 2) 39 | cpk = torch.load( model_path , map_location='cpu') 40 | model.load_state_dict(cpk) 41 | model.eval() 42 | if(is_half==True):model = model.half().to(device) 43 | else:model = model.to(device) 44 | 45 | self.mp = mp 46 | self.model = model 47 | 48 | def _path_audio_(self, music_file ,ins_root=None,vocal_root=None): 49 | if(ins_root is None and vocal_root is None):return "No save root." 50 | name=os.path.basename(music_file) 51 | if(ins_root is not None):os.makedirs(ins_root, exist_ok=True) 52 | if(vocal_root is not None):os.makedirs(vocal_root , exist_ok=True) 53 | X_wave, y_wave, X_spec_s, y_spec_s = {}, {}, {}, {} 54 | bands_n = len(self.mp.param['band']) 55 | # print(bands_n) 56 | for d in range(bands_n, 0, -1): 57 | bp = self.mp.param['band'][d] 58 | if d == bands_n: # high-end band 59 | X_wave[d], _ = librosa.core.load(#理论上librosa读取可能对某些音频有bug,应该上ffmpeg读取,但是太麻烦了弃坑 60 | music_file, bp['sr'], False, dtype=np.float32, res_type=bp['res_type']) 61 | if X_wave[d].ndim == 1: 62 | X_wave[d] = np.asfortranarray([X_wave[d], X_wave[d]]) 63 | else: # lower bands 64 | X_wave[d] = librosa.core.resample(X_wave[d+1], self.mp.param['band'][d+1]['sr'], bp['sr'], res_type=bp['res_type']) 65 | # Stft of wave source 66 | X_spec_s[d] = spec_utils.wave_to_spectrogram_mt(X_wave[d], bp['hl'], bp['n_fft'], self.mp.param['mid_side'], self.mp.param['mid_side_b2'], self.mp.param['reverse']) 67 | # pdb.set_trace() 68 | if d == bands_n and self.data['high_end_process'] != 'none': 69 | input_high_end_h = (bp['n_fft']//2 - bp['crop_stop']) + ( self.mp.param['pre_filter_stop'] - self.mp.param['pre_filter_start']) 70 | input_high_end = X_spec_s[d][:, bp['n_fft']//2-input_high_end_h:bp['n_fft']//2, :] 71 | 72 | X_spec_m = spec_utils.combine_spectrograms(X_spec_s, self.mp) 73 | aggresive_set = float(self.data['agg']/100) 74 | aggressiveness = {'value': aggresive_set, 'split_bin': self.mp.param['band'][1]['crop_stop']} 75 | with torch.no_grad(): 76 | pred, X_mag, X_phase = inference(X_spec_m,self.device,self.model, aggressiveness,self.data) 77 | # Postprocess 78 | if self.data['postprocess']: 79 | pred_inv = np.clip(X_mag - pred, 0, np.inf) 80 | pred = spec_utils.mask_silence(pred, pred_inv) 81 | y_spec_m = pred * X_phase 82 | v_spec_m = X_spec_m - y_spec_m 83 | 84 | if (ins_root is not None): 85 | if self.data['high_end_process'].startswith('mirroring'): 86 | input_high_end_ = spec_utils.mirroring(self.data['high_end_process'], y_spec_m, input_high_end, self.mp) 87 | wav_instrument = spec_utils.cmb_spectrogram_to_wave(y_spec_m, self.mp,input_high_end_h, input_high_end_) 88 | else: 89 | wav_instrument = spec_utils.cmb_spectrogram_to_wave(y_spec_m, self.mp) 90 | print ('%s instruments done'%name) 91 | wavfile.write(os.path.join(ins_root, 'instrument_{}.wav'.format(name) ), self.mp.param['sr'], (np.array(wav_instrument)*32768).astype("int16")) # 92 | if (vocal_root is not None): 93 | if self.data['high_end_process'].startswith('mirroring'): 94 | input_high_end_ = spec_utils.mirroring(self.data['high_end_process'], v_spec_m, input_high_end, self.mp) 95 | wav_vocals = spec_utils.cmb_spectrogram_to_wave(v_spec_m, self.mp, input_high_end_h, input_high_end_) 96 | else: 97 | wav_vocals = spec_utils.cmb_spectrogram_to_wave(v_spec_m, self.mp) 98 | print ('%s vocals done'%name) 99 | wavfile.write(os.path.join(vocal_root , 'vocal_{}.wav'.format(name) ), self.mp.param['sr'], (np.array(wav_vocals)*32768).astype("int16")) 100 | 101 | if __name__ == '__main__': 102 | device = 'cuda' 103 | is_half=True 104 | model_path='uvr5_weights/2_HP-UVR.pth' 105 | pre_fun = _audio_pre_(model_path=model_path,device=device,is_half=True) 106 | audio_path = '神女劈观.aac' 107 | save_path = 'opt' 108 | pre_fun._path_audio_(audio_path , save_path,save_path) 109 | -------------------------------------------------------------------------------- /logs/mute/0_gt_wavs/mute32k.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/0_gt_wavs/mute32k.wav -------------------------------------------------------------------------------- /logs/mute/0_gt_wavs/mute40k.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/0_gt_wavs/mute40k.wav -------------------------------------------------------------------------------- /logs/mute/0_gt_wavs/mute48k.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/0_gt_wavs/mute48k.wav -------------------------------------------------------------------------------- /logs/mute/1_16k_wavs/mute.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/1_16k_wavs/mute.wav -------------------------------------------------------------------------------- /logs/mute/2a_f0/mute.wav.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/2a_f0/mute.wav.npy -------------------------------------------------------------------------------- /logs/mute/2b-f0nsf/mute.wav.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/2b-f0nsf/mute.wav.npy -------------------------------------------------------------------------------- /logs/mute/3_feature256/mute.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/3_feature256/mute.npy -------------------------------------------------------------------------------- /my_utils.py: -------------------------------------------------------------------------------- 1 | import ffmpeg 2 | import numpy as np 3 | def load_audio(file,sr): 4 | try: 5 | # https://github.com/openai/whisper/blob/main/whisper/audio.py#L26 6 | # This launches a subprocess to decode audio while down-mixing and resampling as necessary. 7 | # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed. 8 | file=file.strip(" ").strip('"').strip("\n").strip('"').strip(" ")#防止小白拷路径头尾带了空格和"和回车 9 | out, _ = ( 10 | ffmpeg.input(file, threads=0) 11 | .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr) 12 | .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True) 13 | ) 14 | except Exception as e: 15 | raise RuntimeError(f"Failed to load audio: {e}") 16 | 17 | return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0 18 | -------------------------------------------------------------------------------- /pretrained/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "rvc-beta" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["lj1995"] 6 | license = "MIT" 7 | 8 | [tool.poetry.dependencies] 9 | python = "^3.8" 10 | torch = "^2.0.0" 11 | torchaudio = "^2.0.1" 12 | Cython = "^0.29.34" 13 | gradio = "^3.24.1" 14 | future = "^0.18.3" 15 | pydub = "^0.25.1" 16 | soundfile = "^0.12.1" 17 | ffmpeg-python = "^0.2.0" 18 | tensorboardX = "^2.6" 19 | functorch = "^2.0.0" 20 | fairseq = "^0.12.2" 21 | faiss-gpu = "^1.7.2" 22 | faiss-cpu = "^1.7.3" 23 | Jinja2 = "^3.1.2" 24 | json5 = "^0.9.11" 25 | librosa = "0.9.2" 26 | llvmlite = "0.39.0" 27 | Markdown = "^3.4.3" 28 | matplotlib = "^3.7.1" 29 | matplotlib-inline = "^0.1.6" 30 | numba = "0.56.4" 31 | numpy = "1.23.5" 32 | scipy = "1.9.3" 33 | praat-parselmouth = "^0.4.3" 34 | Pillow = "9.1.1" 35 | pyworld = "^0.3.2" 36 | resampy = "^0.4.2" 37 | scikit-learn = "^1.2.2" 38 | starlette = "^0.26.1" 39 | tensorboard = "^2.12.1" 40 | tensorboard-data-server = "^0.7.0" 41 | tensorboard-plugin-wit = "^1.8.1" 42 | torchgen = "^0.0.1" 43 | tqdm = "^4.65.0" 44 | tornado = "^6.2" 45 | Werkzeug = "^2.2.3" 46 | uc-micro-py = "^1.0.1" 47 | sympy = "^1.11.1" 48 | tabulate = "^0.9.0" 49 | PyYAML = "^6.0" 50 | pyasn1 = "^0.4.8" 51 | pyasn1-modules = "^0.2.8" 52 | fsspec = "^2023.3.0" 53 | absl-py = "^1.4.0" 54 | audioread = "^3.0.0" 55 | uvicorn = "^0.21.1" 56 | colorama = "^0.4.6" 57 | 58 | [tool.poetry.dev-dependencies] 59 | 60 | [build-system] 61 | requires = ["poetry-core>=1.0.0"] 62 | build-backend = "poetry.core.masonry.api" 63 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numba==0.56.4 2 | numpy==1.23.5 3 | scipy==1.9.3 4 | librosa==0.9.2 5 | llvmlite==0.39.0 6 | fairseq==0.12.2 7 | faiss-cpu==1.7.2 8 | gradio 9 | Cython 10 | future>=0.18.3 11 | pydub>=0.25.1 12 | soundfile>=0.12.1 13 | ffmpeg-python>=0.2.0 14 | tensorboardX 15 | functorch>=2.0.0 16 | Jinja2>=3.1.2 17 | json5>=0.9.11 18 | Markdown 19 | matplotlib>=3.7.1 20 | matplotlib-inline>=0.1.6 21 | praat-parselmouth>=0.4.3 22 | Pillow>=9.1.1 23 | pyworld>=0.3.2 24 | resampy>=0.4.2 25 | scikit-learn>=1.2.2 26 | starlette>=0.26.1 27 | tensorboard 28 | tensorboard-data-server 29 | tensorboard-plugin-wit 30 | torchgen>=0.0.1 31 | tqdm>=4.65.0 32 | tornado>=6.2 33 | Werkzeug>=2.2.3 34 | uc-micro-py>=1.0.1 35 | sympy>=1.11.1 36 | tabulate>=0.9.0 37 | PyYAML>=6.0 38 | pyasn1>=0.4.8 39 | pyasn1-modules>=0.2.8 40 | fsspec>=2023.3.0 41 | absl-py>=1.4.0 42 | audioread 43 | uvicorn>=0.21.1 44 | colorama>=0.4.6 45 | -------------------------------------------------------------------------------- /slicer2.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | # This function is obtained from librosa. 5 | def get_rms( 6 | y, 7 | frame_length=2048, 8 | hop_length=512, 9 | pad_mode="constant", 10 | ): 11 | padding = (int(frame_length // 2), int(frame_length // 2)) 12 | y = np.pad(y, padding, mode=pad_mode) 13 | 14 | axis = -1 15 | # put our new within-frame axis at the end for now 16 | out_strides = y.strides + tuple([y.strides[axis]]) 17 | # Reduce the shape on the framing axis 18 | x_shape_trimmed = list(y.shape) 19 | x_shape_trimmed[axis] -= frame_length - 1 20 | out_shape = tuple(x_shape_trimmed) + tuple([frame_length]) 21 | xw = np.lib.stride_tricks.as_strided( 22 | y, shape=out_shape, strides=out_strides 23 | ) 24 | if axis < 0: 25 | target_axis = axis - 1 26 | else: 27 | target_axis = axis + 1 28 | xw = np.moveaxis(xw, -1, target_axis) 29 | # Downsample along the target axis 30 | slices = [slice(None)] * xw.ndim 31 | slices[axis] = slice(0, None, hop_length) 32 | x = xw[tuple(slices)] 33 | 34 | # Calculate power 35 | power = np.mean(np.abs(x) ** 2, axis=-2, keepdims=True) 36 | 37 | return np.sqrt(power) 38 | 39 | 40 | class Slicer: 41 | def __init__(self, 42 | sr: int, 43 | threshold: float = -40., 44 | min_length: int = 5000, 45 | min_interval: int = 300, 46 | hop_size: int = 20, 47 | max_sil_kept: int = 5000): 48 | if not min_length >= min_interval >= hop_size: 49 | raise ValueError('The following condition must be satisfied: min_length >= min_interval >= hop_size') 50 | if not max_sil_kept >= hop_size: 51 | raise ValueError('The following condition must be satisfied: max_sil_kept >= hop_size') 52 | min_interval = sr * min_interval / 1000 53 | self.threshold = 10 ** (threshold / 20.) 54 | self.hop_size = round(sr * hop_size / 1000) 55 | self.win_size = min(round(min_interval), 4 * self.hop_size) 56 | self.min_length = round(sr * min_length / 1000 / self.hop_size) 57 | self.min_interval = round(min_interval / self.hop_size) 58 | self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size) 59 | 60 | def _apply_slice(self, waveform, begin, end): 61 | if len(waveform.shape) > 1: 62 | return waveform[:, begin * self.hop_size: min(waveform.shape[1], end * self.hop_size)] 63 | else: 64 | return waveform[begin * self.hop_size: min(waveform.shape[0], end * self.hop_size)] 65 | 66 | # @timeit 67 | def slice(self, waveform): 68 | if len(waveform.shape) > 1: 69 | samples = waveform.mean(axis=0) 70 | else: 71 | samples = waveform 72 | if samples.shape[0] <= self.min_length: 73 | return [waveform] 74 | rms_list = get_rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0) 75 | sil_tags = [] 76 | silence_start = None 77 | clip_start = 0 78 | for i, rms in enumerate(rms_list): 79 | # Keep looping while frame is silent. 80 | if rms < self.threshold: 81 | # Record start of silent frames. 82 | if silence_start is None: 83 | silence_start = i 84 | continue 85 | # Keep looping while frame is not silent and silence start has not been recorded. 86 | if silence_start is None: 87 | continue 88 | # Clear recorded silence start if interval is not enough or clip is too short 89 | is_leading_silence = silence_start == 0 and i > self.max_sil_kept 90 | need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length 91 | if not is_leading_silence and not need_slice_middle: 92 | silence_start = None 93 | continue 94 | # Need slicing. Record the range of silent frames to be removed. 95 | if i - silence_start <= self.max_sil_kept: 96 | pos = rms_list[silence_start: i + 1].argmin() + silence_start 97 | if silence_start == 0: 98 | sil_tags.append((0, pos)) 99 | else: 100 | sil_tags.append((pos, pos)) 101 | clip_start = pos 102 | elif i - silence_start <= self.max_sil_kept * 2: 103 | pos = rms_list[i - self.max_sil_kept: silence_start + self.max_sil_kept + 1].argmin() 104 | pos += i - self.max_sil_kept 105 | pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start 106 | pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept 107 | if silence_start == 0: 108 | sil_tags.append((0, pos_r)) 109 | clip_start = pos_r 110 | else: 111 | sil_tags.append((min(pos_l, pos), max(pos_r, pos))) 112 | clip_start = max(pos_r, pos) 113 | else: 114 | pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start 115 | pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept 116 | if silence_start == 0: 117 | sil_tags.append((0, pos_r)) 118 | else: 119 | sil_tags.append((pos_l, pos_r)) 120 | clip_start = pos_r 121 | silence_start = None 122 | # Deal with trailing silence. 123 | total_frames = rms_list.shape[0] 124 | if silence_start is not None and total_frames - silence_start >= self.min_interval: 125 | silence_end = min(total_frames, silence_start + self.max_sil_kept) 126 | pos = rms_list[silence_start: silence_end + 1].argmin() + silence_start 127 | sil_tags.append((pos, total_frames + 1)) 128 | # Apply and return slices. 129 | if len(sil_tags) == 0: 130 | return [waveform] 131 | else: 132 | chunks = [] 133 | if sil_tags[0][0] > 0: 134 | chunks.append(self._apply_slice(waveform, 0, sil_tags[0][0])) 135 | for i in range(len(sil_tags) - 1): 136 | chunks.append(self._apply_slice(waveform, sil_tags[i][1], sil_tags[i + 1][0])) 137 | if sil_tags[-1][1] < total_frames: 138 | chunks.append(self._apply_slice(waveform, sil_tags[-1][1], total_frames)) 139 | return chunks 140 | 141 | 142 | def main(): 143 | import os.path 144 | from argparse import ArgumentParser 145 | 146 | import librosa 147 | import soundfile 148 | 149 | parser = ArgumentParser() 150 | parser.add_argument('audio', type=str, help='The audio to be sliced') 151 | parser.add_argument('--out', type=str, help='Output directory of the sliced audio clips') 152 | parser.add_argument('--db_thresh', type=float, required=False, default=-40, 153 | help='The dB threshold for silence detection') 154 | parser.add_argument('--min_length', type=int, required=False, default=5000, 155 | help='The minimum milliseconds required for each sliced audio clip') 156 | parser.add_argument('--min_interval', type=int, required=False, default=300, 157 | help='The minimum milliseconds for a silence part to be sliced') 158 | parser.add_argument('--hop_size', type=int, required=False, default=10, 159 | help='Frame length in milliseconds') 160 | parser.add_argument('--max_sil_kept', type=int, required=False, default=500, 161 | help='The maximum silence length kept around the sliced clip, presented in milliseconds') 162 | args = parser.parse_args() 163 | out = args.out 164 | if out is None: 165 | out = os.path.dirname(os.path.abspath(args.audio)) 166 | audio, sr = librosa.load(args.audio, sr=None, mono=False) 167 | slicer = Slicer( 168 | sr=sr, 169 | threshold=args.db_thresh, 170 | min_length=args.min_length, 171 | min_interval=args.min_interval, 172 | hop_size=args.hop_size, 173 | max_sil_kept=args.max_sil_kept 174 | ) 175 | chunks = slicer.slice(audio) 176 | if not os.path.exists(out): 177 | os.makedirs(out) 178 | for i, chunk in enumerate(chunks): 179 | if len(chunk.shape) > 1: 180 | chunk = chunk.T 181 | soundfile.write(os.path.join(out, f'%s_%d.wav' % (os.path.basename(args.audio).rsplit('.', maxsplit=1)[0], i)), chunk, sr) 182 | 183 | 184 | if __name__ == '__main__': 185 | main() -------------------------------------------------------------------------------- /train/cmd.txt: -------------------------------------------------------------------------------- 1 | python train_nsf_sim_cache_sid.py -c configs/mi_mix40k_nsf_co256_cs1sid_ms2048.json -m ft-mi -------------------------------------------------------------------------------- /train/losses.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.nn import functional as F 3 | 4 | def feature_loss(fmap_r, fmap_g): 5 | loss = 0 6 | for dr, dg in zip(fmap_r, fmap_g): 7 | for rl, gl in zip(dr, dg): 8 | rl = rl.float().detach() 9 | gl = gl.float() 10 | loss += torch.mean(torch.abs(rl - gl)) 11 | 12 | return loss * 2 13 | 14 | 15 | def discriminator_loss(disc_real_outputs, disc_generated_outputs): 16 | loss = 0 17 | r_losses = [] 18 | g_losses = [] 19 | for dr, dg in zip(disc_real_outputs, disc_generated_outputs): 20 | dr = dr.float() 21 | dg = dg.float() 22 | r_loss = torch.mean((1 - dr) ** 2) 23 | g_loss = torch.mean(dg**2) 24 | loss += r_loss + g_loss 25 | r_losses.append(r_loss.item()) 26 | g_losses.append(g_loss.item()) 27 | 28 | return loss, r_losses, g_losses 29 | 30 | 31 | def generator_loss(disc_outputs): 32 | loss = 0 33 | gen_losses = [] 34 | for dg in disc_outputs: 35 | dg = dg.float() 36 | l = torch.mean((1 - dg) ** 2) 37 | gen_losses.append(l) 38 | loss += l 39 | 40 | return loss, gen_losses 41 | 42 | 43 | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask): 44 | """ 45 | z_p, logs_q: [b, h, t_t] 46 | m_p, logs_p: [b, h, t_t] 47 | """ 48 | z_p = z_p.float() 49 | logs_q = logs_q.float() 50 | m_p = m_p.float() 51 | logs_p = logs_p.float() 52 | z_mask = z_mask.float() 53 | 54 | kl = logs_p - logs_q - 0.5 55 | kl += 0.5 * ((z_p - m_p) ** 2) * torch.exp(-2.0 * logs_p) 56 | kl = torch.sum(kl * z_mask) 57 | l = kl / torch.sum(z_mask) 58 | return l 59 | -------------------------------------------------------------------------------- /train/mel_processing.py: -------------------------------------------------------------------------------- 1 | import math 2 | import os 3 | import random 4 | import torch 5 | from torch import nn 6 | import torch.nn.functional as F 7 | import torch.utils.data 8 | import numpy as np 9 | import librosa 10 | import librosa.util as librosa_util 11 | from librosa.util import normalize, pad_center, tiny 12 | from scipy.signal import get_window 13 | from scipy.io.wavfile import read 14 | from librosa.filters import mel as librosa_mel_fn 15 | 16 | MAX_WAV_VALUE = 32768.0 17 | 18 | 19 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5): 20 | """ 21 | PARAMS 22 | ------ 23 | C: compression factor 24 | """ 25 | return torch.log(torch.clamp(x, min=clip_val) * C) 26 | 27 | 28 | def dynamic_range_decompression_torch(x, C=1): 29 | """ 30 | PARAMS 31 | ------ 32 | C: compression factor used to compress 33 | """ 34 | return torch.exp(x) / C 35 | 36 | 37 | def spectral_normalize_torch(magnitudes): 38 | output = dynamic_range_compression_torch(magnitudes) 39 | return output 40 | 41 | 42 | def spectral_de_normalize_torch(magnitudes): 43 | output = dynamic_range_decompression_torch(magnitudes) 44 | return output 45 | 46 | 47 | mel_basis = {} 48 | hann_window = {} 49 | 50 | 51 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False): 52 | if torch.min(y) < -1.0: 53 | print("min value is ", torch.min(y)) 54 | if torch.max(y) > 1.0: 55 | print("max value is ", torch.max(y)) 56 | 57 | global hann_window 58 | dtype_device = str(y.dtype) + "_" + str(y.device) 59 | wnsize_dtype_device = str(win_size) + "_" + dtype_device 60 | if wnsize_dtype_device not in hann_window: 61 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to( 62 | dtype=y.dtype, device=y.device 63 | ) 64 | 65 | y = torch.nn.functional.pad( 66 | y.unsqueeze(1), 67 | (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), 68 | mode="reflect", 69 | ) 70 | y = y.squeeze(1) 71 | 72 | spec = torch.stft( 73 | y, 74 | n_fft, 75 | hop_length=hop_size, 76 | win_length=win_size, 77 | window=hann_window[wnsize_dtype_device], 78 | center=center, 79 | pad_mode="reflect", 80 | normalized=False, 81 | onesided=True,return_complex=False 82 | ) 83 | 84 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 85 | return spec 86 | 87 | 88 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax): 89 | global mel_basis 90 | dtype_device = str(spec.dtype) + "_" + str(spec.device) 91 | fmax_dtype_device = str(fmax) + "_" + dtype_device 92 | if fmax_dtype_device not in mel_basis: 93 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 94 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to( 95 | dtype=spec.dtype, device=spec.device 96 | ) 97 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 98 | spec = spectral_normalize_torch(spec) 99 | return spec 100 | 101 | 102 | def mel_spectrogram_torch( 103 | y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False 104 | ): 105 | if torch.min(y) < -1.0: 106 | print("min value is ", torch.min(y)) 107 | if torch.max(y) > 1.0: 108 | print("max value is ", torch.max(y)) 109 | 110 | global mel_basis, hann_window 111 | dtype_device = str(y.dtype) + "_" + str(y.device) 112 | fmax_dtype_device = str(fmax) + "_" + dtype_device 113 | wnsize_dtype_device = str(win_size) + "_" + dtype_device 114 | if fmax_dtype_device not in mel_basis: 115 | mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) 116 | mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to( 117 | dtype=y.dtype, device=y.device 118 | ) 119 | if wnsize_dtype_device not in hann_window: 120 | hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to( 121 | dtype=y.dtype, device=y.device 122 | ) 123 | 124 | y = torch.nn.functional.pad( 125 | y.unsqueeze(1), 126 | (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)), 127 | mode="reflect", 128 | ) 129 | y = y.squeeze(1) 130 | 131 | # spec = torch.stft( 132 | # y, 133 | # n_fft, 134 | # hop_length=hop_size, 135 | # win_length=win_size, 136 | # window=hann_window[wnsize_dtype_device], 137 | # center=center, 138 | # pad_mode="reflect", 139 | # normalized=False, 140 | # onesided=True, 141 | # ) 142 | spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device], 143 | center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False) 144 | spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6) 145 | 146 | spec = torch.matmul(mel_basis[fmax_dtype_device], spec) 147 | spec = spectral_normalize_torch(spec) 148 | 149 | return spec 150 | -------------------------------------------------------------------------------- /train/process_ckpt.py: -------------------------------------------------------------------------------- 1 | import torch,traceback,os,pdb 2 | from collections import OrderedDict 3 | 4 | def savee(ckpt,sr,if_f0,name,epoch): 5 | try: 6 | opt = OrderedDict() 7 | opt["weight"] = {} 8 | for key in ckpt.keys(): 9 | if ("enc_q" in key): continue 10 | opt["weight"][key] = ckpt[key].half() 11 | if(sr=="40k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000] 12 | elif(sr=="48k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,6,2,2,2], 512, [16, 16, 4, 4,4], 109, 256, 48000] 13 | elif(sr=="32k"):opt["config"] = [513, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 4, 2, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 32000] 14 | opt["info"] = "%sepoch"%epoch 15 | opt["sr"] = sr 16 | opt["f0"] =if_f0 17 | torch.save(opt, "weights/%s.pth"%name) 18 | return "Success." 19 | except: 20 | return traceback.format_exc() 21 | 22 | def show_info(path): 23 | try: 24 | a = torch.load(path, map_location="cpu") 25 | return "模型信息:%s\n采样率:%s\n模型是否输入音高引导:%s"%(a.get("info","None"),a.get("sr","None"),a.get("f0","None"),) 26 | except: 27 | return traceback.format_exc() 28 | 29 | def extract_small_model(path,name,sr,if_f0,info): 30 | try: 31 | ckpt = torch.load(path, map_location="cpu") 32 | if("model"in ckpt):ckpt=ckpt["model"] 33 | opt = OrderedDict() 34 | opt["weight"] = {} 35 | for key in ckpt.keys(): 36 | if ("enc_q" in key): continue 37 | opt["weight"][key] = ckpt[key].half() 38 | if(sr=="40k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000] 39 | elif(sr=="48k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,6,2,2,2], 512, [16, 16, 4, 4,4], 109, 256, 48000] 40 | elif(sr=="32k"):opt["config"] = [513, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 4, 2, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 32000] 41 | if(info==""):info="Extracted model." 42 | opt["info"] = info 43 | opt["sr"] = sr 44 | opt["f0"] =int(if_f0) 45 | torch.save(opt, "weights/%s.pth"%name) 46 | return "Success." 47 | except: 48 | return traceback.format_exc() 49 | 50 | def change_info(path,info,name): 51 | try: 52 | ckpt = torch.load(path, map_location="cpu") 53 | ckpt["info"]=info 54 | if(name==""):name=os.path.basename(path) 55 | torch.save(ckpt, "weights/%s"%name) 56 | return "Success." 57 | except: 58 | return traceback.format_exc() 59 | 60 | def merge(path1,path2,alpha1,sr,f0,info,name): 61 | try: 62 | def extract(ckpt): 63 | a = ckpt["model"] 64 | opt = OrderedDict() 65 | opt["weight"] = {} 66 | for key in a.keys(): 67 | if ("enc_q" in key): continue 68 | opt["weight"][key] = a[key] 69 | return opt 70 | ckpt1 = torch.load(path1, map_location="cpu") 71 | ckpt2 = torch.load(path2, map_location="cpu") 72 | cfg = ckpt1["config"] 73 | if("model"in ckpt1): ckpt1=extract(ckpt1) 74 | else: ckpt1=ckpt1["weight"] 75 | if("model"in ckpt2): ckpt2=extract(ckpt2) 76 | else: ckpt2=ckpt2["weight"] 77 | if(sorted(list(ckpt1.keys()))!=sorted(list(ckpt2.keys()))):return "Fail to merge the models. The model architectures are not the same." 78 | opt = OrderedDict() 79 | opt["weight"] = {} 80 | for key in ckpt1.keys(): 81 | # try: 82 | if(key=="emb_g.weight"and ckpt1[key].shape!=ckpt2[key].shape): 83 | min_shape0=min(ckpt1[key].shape[0],ckpt2[key].shape[0]) 84 | opt["weight"][key] = (alpha1 * (ckpt1[key][:min_shape0].float()) + (1 - alpha1) * (ckpt2[key][:min_shape0].float())).half() 85 | else: 86 | opt["weight"][key] = (alpha1*(ckpt1[key].float())+(1-alpha1)*(ckpt2[key].float())).half() 87 | # except: 88 | # pdb.set_trace() 89 | opt["config"] = cfg 90 | ''' 91 | if(sr=="40k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 10, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 40000] 92 | elif(sr=="48k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,6,2,2,2], 512, [16, 16, 4, 4], 109, 256, 48000] 93 | elif(sr=="32k"):opt["config"] = [513, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 4, 2, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 32000] 94 | ''' 95 | opt["sr"]=sr 96 | opt["f0"]=1 if f0=="是"else 0 97 | opt["info"]=info 98 | torch.save(opt, "weights/%s.pth"%name) 99 | return "Success." 100 | except: 101 | return traceback.format_exc() 102 | -------------------------------------------------------------------------------- /train/utils.py: -------------------------------------------------------------------------------- 1 | import os,traceback 2 | import glob 3 | import sys 4 | import argparse 5 | import logging 6 | import json 7 | import subprocess 8 | import numpy as np 9 | from scipy.io.wavfile import read 10 | import torch 11 | 12 | MATPLOTLIB_FLAG = False 13 | 14 | logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) 15 | logger = logging 16 | 17 | def load_checkpoint_d(checkpoint_path, combd,sbd, optimizer=None,load_opt=1): 18 | assert os.path.isfile(checkpoint_path) 19 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 20 | 21 | ################## 22 | def go(model,bkey): 23 | saved_state_dict = checkpoint_dict[bkey] 24 | if hasattr(model, 'module'):state_dict = model.module.state_dict() 25 | else:state_dict = model.state_dict() 26 | new_state_dict= {} 27 | for k, v in state_dict.items():#模型需要的shape 28 | try: 29 | new_state_dict[k] = saved_state_dict[k] 30 | if(saved_state_dict[k].shape!=state_dict[k].shape): 31 | print("shape-%s-mismatch|need-%s|get-%s"%(k,state_dict[k].shape,saved_state_dict[k].shape))# 32 | raise KeyError 33 | except: 34 | # logger.info(traceback.format_exc()) 35 | logger.info("%s is not in the checkpoint" % k)#pretrain缺失的 36 | new_state_dict[k] = v#模型自带的随机值 37 | if hasattr(model, 'module'): 38 | model.module.load_state_dict(new_state_dict,strict=False) 39 | else: 40 | model.load_state_dict(new_state_dict,strict=False) 41 | go(combd,"combd") 42 | go(sbd,"sbd") 43 | ############# 44 | logger.info("Loaded model weights") 45 | 46 | iteration = checkpoint_dict['iteration'] 47 | learning_rate = checkpoint_dict['learning_rate'] 48 | if optimizer is not None and load_opt==1:###加载不了,如果是空的的话,重新初始化,可能还会影响lr时间表的更新,因此在train文件最外围catch 49 | # try: 50 | optimizer.load_state_dict(checkpoint_dict['optimizer']) 51 | # except: 52 | # traceback.print_exc() 53 | logger.info("Loaded checkpoint '{}' (epoch {})" .format(checkpoint_path, iteration)) 54 | return model, optimizer, learning_rate, iteration 55 | 56 | 57 | # def load_checkpoint(checkpoint_path, model, optimizer=None): 58 | # assert os.path.isfile(checkpoint_path) 59 | # checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 60 | # iteration = checkpoint_dict['iteration'] 61 | # learning_rate = checkpoint_dict['learning_rate'] 62 | # if optimizer is not None: 63 | # optimizer.load_state_dict(checkpoint_dict['optimizer']) 64 | # # print(1111) 65 | # saved_state_dict = checkpoint_dict['model'] 66 | # # print(1111) 67 | # 68 | # if hasattr(model, 'module'): 69 | # state_dict = model.module.state_dict() 70 | # else: 71 | # state_dict = model.state_dict() 72 | # new_state_dict= {} 73 | # for k, v in state_dict.items(): 74 | # try: 75 | # new_state_dict[k] = saved_state_dict[k] 76 | # except: 77 | # logger.info("%s is not in the checkpoint" % k) 78 | # new_state_dict[k] = v 79 | # if hasattr(model, 'module'): 80 | # model.module.load_state_dict(new_state_dict) 81 | # else: 82 | # model.load_state_dict(new_state_dict) 83 | # logger.info("Loaded checkpoint '{}' (epoch {})" .format( 84 | # checkpoint_path, iteration)) 85 | # return model, optimizer, learning_rate, iteration 86 | def load_checkpoint(checkpoint_path, model, optimizer=None,load_opt=1): 87 | assert os.path.isfile(checkpoint_path) 88 | checkpoint_dict = torch.load(checkpoint_path, map_location='cpu') 89 | 90 | saved_state_dict = checkpoint_dict['model'] 91 | if hasattr(model, 'module'): 92 | state_dict = model.module.state_dict() 93 | else: 94 | state_dict = model.state_dict() 95 | new_state_dict= {} 96 | for k, v in state_dict.items():#模型需要的shape 97 | try: 98 | new_state_dict[k] = saved_state_dict[k] 99 | if(saved_state_dict[k].shape!=state_dict[k].shape): 100 | print("shape-%s-mismatch|need-%s|get-%s"%(k,state_dict[k].shape,saved_state_dict[k].shape))# 101 | raise KeyError 102 | except: 103 | # logger.info(traceback.format_exc()) 104 | logger.info("%s is not in the checkpoint" % k)#pretrain缺失的 105 | new_state_dict[k] = v#模型自带的随机值 106 | if hasattr(model, 'module'): 107 | model.module.load_state_dict(new_state_dict,strict=False) 108 | else: 109 | model.load_state_dict(new_state_dict,strict=False) 110 | logger.info("Loaded model weights") 111 | 112 | iteration = checkpoint_dict['iteration'] 113 | learning_rate = checkpoint_dict['learning_rate'] 114 | if optimizer is not None and load_opt==1:###加载不了,如果是空的的话,重新初始化,可能还会影响lr时间表的更新,因此在train文件最外围catch 115 | # try: 116 | optimizer.load_state_dict(checkpoint_dict['optimizer']) 117 | # except: 118 | # traceback.print_exc() 119 | logger.info("Loaded checkpoint '{}' (epoch {})" .format(checkpoint_path, iteration)) 120 | return model, optimizer, learning_rate, iteration 121 | 122 | 123 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path): 124 | logger.info("Saving model and optimizer state at epoch {} to {}".format( 125 | iteration, checkpoint_path)) 126 | if hasattr(model, 'module'): 127 | state_dict = model.module.state_dict() 128 | else: 129 | state_dict = model.state_dict() 130 | torch.save({'model': state_dict, 131 | 'iteration': iteration, 132 | 'optimizer': optimizer.state_dict(), 133 | 'learning_rate': learning_rate}, checkpoint_path) 134 | def save_checkpoint_d(combd, sbd, optimizer, learning_rate, iteration, checkpoint_path): 135 | logger.info("Saving model and optimizer state at epoch {} to {}".format( 136 | iteration, checkpoint_path)) 137 | if hasattr(combd, 'module'): state_dict_combd = combd.module.state_dict() 138 | else:state_dict_combd = combd.state_dict() 139 | if hasattr(sbd, 'module'): state_dict_sbd = sbd.module.state_dict() 140 | else:state_dict_sbd = sbd.state_dict() 141 | torch.save({ 142 | 'combd': state_dict_combd, 143 | 'sbd': state_dict_sbd, 144 | 'iteration': iteration, 145 | 'optimizer': optimizer.state_dict(), 146 | 'learning_rate': learning_rate}, checkpoint_path) 147 | 148 | 149 | def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050): 150 | for k, v in scalars.items(): 151 | writer.add_scalar(k, v, global_step) 152 | for k, v in histograms.items(): 153 | writer.add_histogram(k, v, global_step) 154 | for k, v in images.items(): 155 | writer.add_image(k, v, global_step, dataformats='HWC') 156 | for k, v in audios.items(): 157 | writer.add_audio(k, v, global_step, audio_sampling_rate) 158 | 159 | 160 | def latest_checkpoint_path(dir_path, regex="G_*.pth"): 161 | f_list = glob.glob(os.path.join(dir_path, regex)) 162 | f_list.sort(key=lambda f: int("".join(filter(str.isdigit, f)))) 163 | x = f_list[-1] 164 | print(x) 165 | return x 166 | 167 | 168 | def plot_spectrogram_to_numpy(spectrogram): 169 | global MATPLOTLIB_FLAG 170 | if not MATPLOTLIB_FLAG: 171 | import matplotlib 172 | matplotlib.use("Agg") 173 | MATPLOTLIB_FLAG = True 174 | mpl_logger = logging.getLogger('matplotlib') 175 | mpl_logger.setLevel(logging.WARNING) 176 | import matplotlib.pylab as plt 177 | import numpy as np 178 | 179 | fig, ax = plt.subplots(figsize=(10,2)) 180 | im = ax.imshow(spectrogram, aspect="auto", origin="lower", 181 | interpolation='none') 182 | plt.colorbar(im, ax=ax) 183 | plt.xlabel("Frames") 184 | plt.ylabel("Channels") 185 | plt.tight_layout() 186 | 187 | fig.canvas.draw() 188 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 189 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 190 | plt.close() 191 | return data 192 | 193 | 194 | def plot_alignment_to_numpy(alignment, info=None): 195 | global MATPLOTLIB_FLAG 196 | if not MATPLOTLIB_FLAG: 197 | import matplotlib 198 | matplotlib.use("Agg") 199 | MATPLOTLIB_FLAG = True 200 | mpl_logger = logging.getLogger('matplotlib') 201 | mpl_logger.setLevel(logging.WARNING) 202 | import matplotlib.pylab as plt 203 | import numpy as np 204 | 205 | fig, ax = plt.subplots(figsize=(6, 4)) 206 | im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower', 207 | interpolation='none') 208 | fig.colorbar(im, ax=ax) 209 | xlabel = 'Decoder timestep' 210 | if info is not None: 211 | xlabel += '\n\n' + info 212 | plt.xlabel(xlabel) 213 | plt.ylabel('Encoder timestep') 214 | plt.tight_layout() 215 | 216 | fig.canvas.draw() 217 | data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='') 218 | data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,)) 219 | plt.close() 220 | return data 221 | 222 | 223 | def load_wav_to_torch(full_path): 224 | sampling_rate, data = read(full_path) 225 | return torch.FloatTensor(data.astype(np.float32)), sampling_rate 226 | 227 | 228 | def load_filepaths_and_text(filename, split="|"): 229 | with open(filename, encoding='utf-8') as f: 230 | filepaths_and_text = [line.strip().split(split) for line in f] 231 | return filepaths_and_text 232 | 233 | 234 | def get_hparams(init=True): 235 | ''' 236 | todo: 237 | 结尾七人组: 238 | 保存频率、总epoch done 239 | bs done 240 | pretrainG、pretrainD done 241 | 卡号:os.en["CUDA_VISIBLE_DEVICES"] done 242 | if_latest todo 243 | 模型:if_f0 todo 244 | 采样率:自动选择config done 245 | 是否缓存数据集进GPU:if_cache_data_in_gpu done 246 | 247 | -m: 248 | 自动决定training_files路径,改掉train_nsf_load_pretrain.py里的hps.data.training_files done 249 | -c不要了 250 | ''' 251 | parser = argparse.ArgumentParser() 252 | # parser.add_argument('-c', '--config', type=str, default="configs/40k.json",help='JSON file for configuration') 253 | parser.add_argument('-se', '--save_every_epoch', type=int, required=True,help='checkpoint save frequency (epoch)') 254 | parser.add_argument('-te', '--total_epoch', type=int, required=True,help='total_epoch') 255 | parser.add_argument('-pg', '--pretrainG', type=str, default="",help='Pretrained Discriminator path') 256 | parser.add_argument('-pd', '--pretrainD', type=str, default="",help='Pretrained Generator path') 257 | parser.add_argument('-g', '--gpus', type=str, default="0",help='split by -') 258 | parser.add_argument('-bs', '--batch_size', type=int, required=True,help='batch size') 259 | parser.add_argument('-e', '--experiment_dir', type=str, required=True,help='experiment dir')#-m 260 | parser.add_argument('-sr', '--sample_rate', type=str, required=True,help='sample rate, 32k/40k/48k') 261 | parser.add_argument('-f0', '--if_f0', type=int, required=True,help='use f0 as one of the inputs of the model, 1 or 0') 262 | parser.add_argument('-l', '--if_latest', type=int, required=True,help='if only save the latest G/D pth file, 1 or 0') 263 | parser.add_argument('-c', '--if_cache_data_in_gpu', type=int, required=True,help='if caching the dataset in GPU memory, 1 or 0') 264 | 265 | args = parser.parse_args() 266 | name = args.experiment_dir 267 | experiment_dir = os.path.join("./logs", args.experiment_dir) 268 | 269 | if not os.path.exists(experiment_dir): 270 | os.makedirs(experiment_dir) 271 | 272 | config_path = "configs/%s.json"%args.sample_rate 273 | config_save_path = os.path.join(experiment_dir, "config.json") 274 | if init: 275 | with open(config_path, "r") as f: 276 | data = f.read() 277 | with open(config_save_path, "w") as f: 278 | f.write(data) 279 | else: 280 | with open(config_save_path, "r") as f: 281 | data = f.read() 282 | config = json.loads(data) 283 | 284 | hparams = HParams(**config) 285 | hparams.model_dir = hparams.experiment_dir = experiment_dir 286 | hparams.save_every_epoch = args.save_every_epoch 287 | hparams.name = name 288 | hparams.total_epoch = args.total_epoch 289 | hparams.pretrainG = args.pretrainG 290 | hparams.pretrainD = args.pretrainD 291 | hparams.gpus = args.gpus 292 | hparams.train.batch_size = args.batch_size 293 | hparams.sample_rate = args.sample_rate 294 | hparams.if_f0 = args.if_f0 295 | hparams.if_latest = args.if_latest 296 | hparams.if_cache_data_in_gpu = args.if_cache_data_in_gpu 297 | hparams.data.training_files = "%s/filelist.txt"%experiment_dir 298 | return hparams 299 | 300 | 301 | def get_hparams_from_dir(model_dir): 302 | config_save_path = os.path.join(model_dir, "config.json") 303 | with open(config_save_path, "r") as f: 304 | data = f.read() 305 | config = json.loads(data) 306 | 307 | hparams =HParams(**config) 308 | hparams.model_dir = model_dir 309 | return hparams 310 | 311 | 312 | def get_hparams_from_file(config_path): 313 | with open(config_path, "r") as f: 314 | data = f.read() 315 | config = json.loads(data) 316 | 317 | hparams =HParams(**config) 318 | return hparams 319 | 320 | 321 | def check_git_hash(model_dir): 322 | source_dir = os.path.dirname(os.path.realpath(__file__)) 323 | if not os.path.exists(os.path.join(source_dir, ".git")): 324 | logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format( 325 | source_dir 326 | )) 327 | return 328 | 329 | cur_hash = subprocess.getoutput("git rev-parse HEAD") 330 | 331 | path = os.path.join(model_dir, "githash") 332 | if os.path.exists(path): 333 | saved_hash = open(path).read() 334 | if saved_hash != cur_hash: 335 | logger.warn("git hash values are different. {}(saved) != {}(current)".format( 336 | saved_hash[:8], cur_hash[:8])) 337 | else: 338 | open(path, "w").write(cur_hash) 339 | 340 | 341 | def get_logger(model_dir, filename="train.log"): 342 | global logger 343 | logger = logging.getLogger(os.path.basename(model_dir)) 344 | logger.setLevel(logging.DEBUG) 345 | 346 | formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s") 347 | if not os.path.exists(model_dir): 348 | os.makedirs(model_dir) 349 | h = logging.FileHandler(os.path.join(model_dir, filename)) 350 | h.setLevel(logging.DEBUG) 351 | h.setFormatter(formatter) 352 | logger.addHandler(h) 353 | return logger 354 | 355 | 356 | class HParams(): 357 | def __init__(self, **kwargs): 358 | for k, v in kwargs.items(): 359 | if type(v) == dict: 360 | v = HParams(**v) 361 | self[k] = v 362 | 363 | def keys(self): 364 | return self.__dict__.keys() 365 | 366 | def items(self): 367 | return self.__dict__.items() 368 | 369 | def values(self): 370 | return self.__dict__.values() 371 | 372 | def __len__(self): 373 | return len(self.__dict__) 374 | 375 | def __getitem__(self, key): 376 | return getattr(self, key) 377 | 378 | def __setitem__(self, key, value): 379 | return setattr(self, key, value) 380 | 381 | def __contains__(self, key): 382 | return key in self.__dict__ 383 | 384 | def __repr__(self): 385 | return self.__dict__.__repr__() 386 | -------------------------------------------------------------------------------- /trainset_preprocess_pipeline_print.py: -------------------------------------------------------------------------------- 1 | import sys,os,multiprocessing 2 | now_dir=os.getcwd() 3 | sys.path.append(now_dir) 4 | 5 | inp_root = sys.argv[1] 6 | sr = int(sys.argv[2]) 7 | n_p = int(sys.argv[3]) 8 | exp_dir = sys.argv[4] 9 | noparallel = sys.argv[5] == "True" 10 | import numpy as np,os,traceback 11 | from slicer2 import Slicer 12 | import librosa,traceback 13 | from scipy.io import wavfile 14 | import multiprocessing 15 | from my_utils import load_audio 16 | 17 | mutex = multiprocessing.Lock() 18 | 19 | class PreProcess(): 20 | def __init__(self,sr,exp_dir): 21 | self.slicer = Slicer( 22 | sr=sr, 23 | threshold=-32, 24 | min_length=800, 25 | min_interval=400, 26 | hop_size=15, 27 | max_sil_kept=150 28 | ) 29 | self.sr=sr 30 | self.per=3.7 31 | self.overlap=0.3 32 | self.tail=self.per+self.overlap 33 | self.max=0.95 34 | self.alpha=0.8 35 | self.exp_dir=exp_dir 36 | self.gt_wavs_dir="%s/0_gt_wavs"%exp_dir 37 | self.wavs16k_dir="%s/1_16k_wavs"%exp_dir 38 | self.f = open("%s/preprocess.log"%exp_dir, "a+") 39 | os.makedirs(self.exp_dir,exist_ok=True) 40 | os.makedirs(self.gt_wavs_dir,exist_ok=True) 41 | os.makedirs(self.wavs16k_dir,exist_ok=True) 42 | 43 | def print(self, strr): 44 | mutex.acquire() 45 | print(strr) 46 | self.f.write("%s\n" % strr) 47 | self.f.flush() 48 | mutex.release() 49 | 50 | def norm_write(self,tmp_audio,idx0,idx1): 51 | tmp_audio = (tmp_audio / np.abs(tmp_audio).max() * (self.max * self.alpha)) + (1 - self.alpha) * tmp_audio 52 | wavfile.write("%s/%s_%s.wav" % (self.gt_wavs_dir, idx0, idx1), self.sr, (tmp_audio*32768).astype(np.int16)) 53 | tmp_audio = librosa.resample(tmp_audio, orig_sr=self.sr, target_sr=16000) 54 | wavfile.write("%s/%s_%s.wav" % (self.wavs16k_dir, idx0, idx1), 16000, (tmp_audio*32768).astype(np.int16)) 55 | 56 | def pipeline(self,path, idx0): 57 | try: 58 | audio = load_audio(path,self.sr) 59 | idx1=0 60 | for audio in self.slicer.slice(audio): 61 | i = 0 62 | while (1): 63 | start = int(self.sr * (self.per - self.overlap) * i) 64 | i += 1 65 | if (len(audio[start:]) > self.tail * self.sr): 66 | tmp_audio = audio[start:start + int(self.per * self.sr)] 67 | self.norm_write(tmp_audio,idx0,idx1) 68 | idx1 += 1 69 | else: 70 | tmp_audio = audio[start:] 71 | break 72 | self.norm_write(tmp_audio, idx0, idx1) 73 | self.print("%s->Suc."%path) 74 | except: 75 | self.print("%s->%s"%(path,traceback.format_exc())) 76 | 77 | def pipeline_mp(self,infos): 78 | for path, idx0 in infos: 79 | self.pipeline(path,idx0) 80 | 81 | def pipeline_mp_inp_dir(self,inp_root,n_p): 82 | try: 83 | infos = [("%s/%s" % (inp_root, name), idx) for idx, name in enumerate(sorted(list(os.listdir(inp_root))))] 84 | if noparallel: 85 | for i in range(n_p): self.pipeline_mp(infos[i::n_p]) 86 | else: 87 | ps=[] 88 | for i in range(n_p): 89 | p=multiprocessing.Process(target=self.pipeline_mp,args=(infos[i::n_p],)) 90 | p.start() 91 | ps.append(p) 92 | for p in ps:p.join() 93 | except: 94 | self.print("Fail. %s"%traceback.format_exc()) 95 | 96 | def preprocess_trainset(inp_root, sr, n_p, exp_dir): 97 | pp=PreProcess(sr,exp_dir) 98 | pp.print("start preprocess") 99 | pp.print(sys.argv) 100 | pp.pipeline_mp_inp_dir(inp_root,n_p) 101 | pp.print("end preprocess") 102 | 103 | if __name__=='__main__': 104 | preprocess_trainset(inp_root, sr, n_p, exp_dir) 105 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | 4 | import numpy as np 5 | import torch 6 | import torch.utils.data 7 | from tqdm import tqdm 8 | 9 | from uvr5_pack.lib_v5 import spec_utils 10 | 11 | 12 | class VocalRemoverValidationSet(torch.utils.data.Dataset): 13 | 14 | def __init__(self, patch_list): 15 | self.patch_list = patch_list 16 | 17 | def __len__(self): 18 | return len(self.patch_list) 19 | 20 | def __getitem__(self, idx): 21 | path = self.patch_list[idx] 22 | data = np.load(path) 23 | 24 | X, y = data['X'], data['y'] 25 | 26 | X_mag = np.abs(X) 27 | y_mag = np.abs(y) 28 | 29 | return X_mag, y_mag 30 | 31 | 32 | def make_pair(mix_dir, inst_dir): 33 | input_exts = ['.wav', '.m4a', '.mp3', '.mp4', '.flac'] 34 | 35 | X_list = sorted([ 36 | os.path.join(mix_dir, fname) 37 | for fname in os.listdir(mix_dir) 38 | if os.path.splitext(fname)[1] in input_exts]) 39 | y_list = sorted([ 40 | os.path.join(inst_dir, fname) 41 | for fname in os.listdir(inst_dir) 42 | if os.path.splitext(fname)[1] in input_exts]) 43 | 44 | filelist = list(zip(X_list, y_list)) 45 | 46 | return filelist 47 | 48 | 49 | def train_val_split(dataset_dir, split_mode, val_rate, val_filelist): 50 | if split_mode == 'random': 51 | filelist = make_pair( 52 | os.path.join(dataset_dir, 'mixtures'), 53 | os.path.join(dataset_dir, 'instruments')) 54 | 55 | random.shuffle(filelist) 56 | 57 | if len(val_filelist) == 0: 58 | val_size = int(len(filelist) * val_rate) 59 | train_filelist = filelist[:-val_size] 60 | val_filelist = filelist[-val_size:] 61 | else: 62 | train_filelist = [ 63 | pair for pair in filelist 64 | if list(pair) not in val_filelist] 65 | elif split_mode == 'subdirs': 66 | if len(val_filelist) != 0: 67 | raise ValueError('The `val_filelist` option is not available in `subdirs` mode') 68 | 69 | train_filelist = make_pair( 70 | os.path.join(dataset_dir, 'training/mixtures'), 71 | os.path.join(dataset_dir, 'training/instruments')) 72 | 73 | val_filelist = make_pair( 74 | os.path.join(dataset_dir, 'validation/mixtures'), 75 | os.path.join(dataset_dir, 'validation/instruments')) 76 | 77 | return train_filelist, val_filelist 78 | 79 | 80 | def augment(X, y, reduction_rate, reduction_mask, mixup_rate, mixup_alpha): 81 | perm = np.random.permutation(len(X)) 82 | for i, idx in enumerate(tqdm(perm)): 83 | if np.random.uniform() < reduction_rate: 84 | y[idx] = spec_utils.reduce_vocal_aggressively(X[idx], y[idx], reduction_mask) 85 | 86 | if np.random.uniform() < 0.5: 87 | # swap channel 88 | X[idx] = X[idx, ::-1] 89 | y[idx] = y[idx, ::-1] 90 | if np.random.uniform() < 0.02: 91 | # mono 92 | X[idx] = X[idx].mean(axis=0, keepdims=True) 93 | y[idx] = y[idx].mean(axis=0, keepdims=True) 94 | if np.random.uniform() < 0.02: 95 | # inst 96 | X[idx] = y[idx] 97 | 98 | if np.random.uniform() < mixup_rate and i < len(perm) - 1: 99 | lam = np.random.beta(mixup_alpha, mixup_alpha) 100 | X[idx] = lam * X[idx] + (1 - lam) * X[perm[i + 1]] 101 | y[idx] = lam * y[idx] + (1 - lam) * y[perm[i + 1]] 102 | 103 | return X, y 104 | 105 | 106 | def make_padding(width, cropsize, offset): 107 | left = offset 108 | roi_size = cropsize - left * 2 109 | if roi_size == 0: 110 | roi_size = cropsize 111 | right = roi_size - (width % roi_size) + left 112 | 113 | return left, right, roi_size 114 | 115 | 116 | def make_training_set(filelist, cropsize, patches, sr, hop_length, n_fft, offset): 117 | len_dataset = patches * len(filelist) 118 | 119 | X_dataset = np.zeros( 120 | (len_dataset, 2, n_fft // 2 + 1, cropsize), dtype=np.complex64) 121 | y_dataset = np.zeros( 122 | (len_dataset, 2, n_fft // 2 + 1, cropsize), dtype=np.complex64) 123 | 124 | for i, (X_path, y_path) in enumerate(tqdm(filelist)): 125 | X, y = spec_utils.cache_or_load(X_path, y_path, sr, hop_length, n_fft) 126 | coef = np.max([np.abs(X).max(), np.abs(y).max()]) 127 | X, y = X / coef, y / coef 128 | 129 | l, r, roi_size = make_padding(X.shape[2], cropsize, offset) 130 | X_pad = np.pad(X, ((0, 0), (0, 0), (l, r)), mode='constant') 131 | y_pad = np.pad(y, ((0, 0), (0, 0), (l, r)), mode='constant') 132 | 133 | starts = np.random.randint(0, X_pad.shape[2] - cropsize, patches) 134 | ends = starts + cropsize 135 | for j in range(patches): 136 | idx = i * patches + j 137 | X_dataset[idx] = X_pad[:, :, starts[j]:ends[j]] 138 | y_dataset[idx] = y_pad[:, :, starts[j]:ends[j]] 139 | 140 | return X_dataset, y_dataset 141 | 142 | 143 | def make_validation_set(filelist, cropsize, sr, hop_length, n_fft, offset): 144 | patch_list = [] 145 | patch_dir = 'cs{}_sr{}_hl{}_nf{}_of{}'.format(cropsize, sr, hop_length, n_fft, offset) 146 | os.makedirs(patch_dir, exist_ok=True) 147 | 148 | for i, (X_path, y_path) in enumerate(tqdm(filelist)): 149 | basename = os.path.splitext(os.path.basename(X_path))[0] 150 | 151 | X, y = spec_utils.cache_or_load(X_path, y_path, sr, hop_length, n_fft) 152 | coef = np.max([np.abs(X).max(), np.abs(y).max()]) 153 | X, y = X / coef, y / coef 154 | 155 | l, r, roi_size = make_padding(X.shape[2], cropsize, offset) 156 | X_pad = np.pad(X, ((0, 0), (0, 0), (l, r)), mode='constant') 157 | y_pad = np.pad(y, ((0, 0), (0, 0), (l, r)), mode='constant') 158 | 159 | len_dataset = int(np.ceil(X.shape[2] / roi_size)) 160 | for j in range(len_dataset): 161 | outpath = os.path.join(patch_dir, '{}_p{}.npz'.format(basename, j)) 162 | start = j * roi_size 163 | if not os.path.exists(outpath): 164 | np.savez( 165 | outpath, 166 | X=X_pad[:, :, start:start + cropsize], 167 | y=y_pad[:, :, start:start + cropsize]) 168 | patch_list.append(outpath) 169 | 170 | return VocalRemoverValidationSet(patch_list) 171 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/layers.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import spec_utils 6 | 7 | 8 | class Conv2DBNActiv(nn.Module): 9 | 10 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 11 | super(Conv2DBNActiv, self).__init__() 12 | self.conv = nn.Sequential( 13 | nn.Conv2d( 14 | nin, nout, 15 | kernel_size=ksize, 16 | stride=stride, 17 | padding=pad, 18 | dilation=dilation, 19 | bias=False), 20 | nn.BatchNorm2d(nout), 21 | activ() 22 | ) 23 | 24 | def __call__(self, x): 25 | return self.conv(x) 26 | 27 | 28 | class SeperableConv2DBNActiv(nn.Module): 29 | 30 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 31 | super(SeperableConv2DBNActiv, self).__init__() 32 | self.conv = nn.Sequential( 33 | nn.Conv2d( 34 | nin, nin, 35 | kernel_size=ksize, 36 | stride=stride, 37 | padding=pad, 38 | dilation=dilation, 39 | groups=nin, 40 | bias=False), 41 | nn.Conv2d( 42 | nin, nout, 43 | kernel_size=1, 44 | bias=False), 45 | nn.BatchNorm2d(nout), 46 | activ() 47 | ) 48 | 49 | def __call__(self, x): 50 | return self.conv(x) 51 | 52 | 53 | class Encoder(nn.Module): 54 | 55 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU): 56 | super(Encoder, self).__init__() 57 | self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 58 | self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ) 59 | 60 | def __call__(self, x): 61 | skip = self.conv1(x) 62 | h = self.conv2(skip) 63 | 64 | return h, skip 65 | 66 | 67 | class Decoder(nn.Module): 68 | 69 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False): 70 | super(Decoder, self).__init__() 71 | self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 72 | self.dropout = nn.Dropout2d(0.1) if dropout else None 73 | 74 | def __call__(self, x, skip=None): 75 | x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) 76 | if skip is not None: 77 | skip = spec_utils.crop_center(skip, x) 78 | x = torch.cat([x, skip], dim=1) 79 | h = self.conv(x) 80 | 81 | if self.dropout is not None: 82 | h = self.dropout(h) 83 | 84 | return h 85 | 86 | 87 | class ASPPModule(nn.Module): 88 | 89 | def __init__(self, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU): 90 | super(ASPPModule, self).__init__() 91 | self.conv1 = nn.Sequential( 92 | nn.AdaptiveAvgPool2d((1, None)), 93 | Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 94 | ) 95 | self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 96 | self.conv3 = SeperableConv2DBNActiv( 97 | nin, nin, 3, 1, dilations[0], dilations[0], activ=activ) 98 | self.conv4 = SeperableConv2DBNActiv( 99 | nin, nin, 3, 1, dilations[1], dilations[1], activ=activ) 100 | self.conv5 = SeperableConv2DBNActiv( 101 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 102 | self.bottleneck = nn.Sequential( 103 | Conv2DBNActiv(nin * 5, nout, 1, 1, 0, activ=activ), 104 | nn.Dropout2d(0.1) 105 | ) 106 | 107 | def forward(self, x): 108 | _, _, h, w = x.size() 109 | feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True) 110 | feat2 = self.conv2(x) 111 | feat3 = self.conv3(x) 112 | feat4 = self.conv4(x) 113 | feat5 = self.conv5(x) 114 | out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1) 115 | bottle = self.bottleneck(out) 116 | return bottle 117 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/layers_123812KB .py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import spec_utils 6 | 7 | 8 | class Conv2DBNActiv(nn.Module): 9 | 10 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 11 | super(Conv2DBNActiv, self).__init__() 12 | self.conv = nn.Sequential( 13 | nn.Conv2d( 14 | nin, nout, 15 | kernel_size=ksize, 16 | stride=stride, 17 | padding=pad, 18 | dilation=dilation, 19 | bias=False), 20 | nn.BatchNorm2d(nout), 21 | activ() 22 | ) 23 | 24 | def __call__(self, x): 25 | return self.conv(x) 26 | 27 | 28 | class SeperableConv2DBNActiv(nn.Module): 29 | 30 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 31 | super(SeperableConv2DBNActiv, self).__init__() 32 | self.conv = nn.Sequential( 33 | nn.Conv2d( 34 | nin, nin, 35 | kernel_size=ksize, 36 | stride=stride, 37 | padding=pad, 38 | dilation=dilation, 39 | groups=nin, 40 | bias=False), 41 | nn.Conv2d( 42 | nin, nout, 43 | kernel_size=1, 44 | bias=False), 45 | nn.BatchNorm2d(nout), 46 | activ() 47 | ) 48 | 49 | def __call__(self, x): 50 | return self.conv(x) 51 | 52 | 53 | class Encoder(nn.Module): 54 | 55 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU): 56 | super(Encoder, self).__init__() 57 | self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 58 | self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ) 59 | 60 | def __call__(self, x): 61 | skip = self.conv1(x) 62 | h = self.conv2(skip) 63 | 64 | return h, skip 65 | 66 | 67 | class Decoder(nn.Module): 68 | 69 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False): 70 | super(Decoder, self).__init__() 71 | self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 72 | self.dropout = nn.Dropout2d(0.1) if dropout else None 73 | 74 | def __call__(self, x, skip=None): 75 | x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) 76 | if skip is not None: 77 | skip = spec_utils.crop_center(skip, x) 78 | x = torch.cat([x, skip], dim=1) 79 | h = self.conv(x) 80 | 81 | if self.dropout is not None: 82 | h = self.dropout(h) 83 | 84 | return h 85 | 86 | 87 | class ASPPModule(nn.Module): 88 | 89 | def __init__(self, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU): 90 | super(ASPPModule, self).__init__() 91 | self.conv1 = nn.Sequential( 92 | nn.AdaptiveAvgPool2d((1, None)), 93 | Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 94 | ) 95 | self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 96 | self.conv3 = SeperableConv2DBNActiv( 97 | nin, nin, 3, 1, dilations[0], dilations[0], activ=activ) 98 | self.conv4 = SeperableConv2DBNActiv( 99 | nin, nin, 3, 1, dilations[1], dilations[1], activ=activ) 100 | self.conv5 = SeperableConv2DBNActiv( 101 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 102 | self.bottleneck = nn.Sequential( 103 | Conv2DBNActiv(nin * 5, nout, 1, 1, 0, activ=activ), 104 | nn.Dropout2d(0.1) 105 | ) 106 | 107 | def forward(self, x): 108 | _, _, h, w = x.size() 109 | feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True) 110 | feat2 = self.conv2(x) 111 | feat3 = self.conv3(x) 112 | feat4 = self.conv4(x) 113 | feat5 = self.conv5(x) 114 | out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1) 115 | bottle = self.bottleneck(out) 116 | return bottle 117 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/layers_123821KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import spec_utils 6 | 7 | 8 | class Conv2DBNActiv(nn.Module): 9 | 10 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 11 | super(Conv2DBNActiv, self).__init__() 12 | self.conv = nn.Sequential( 13 | nn.Conv2d( 14 | nin, nout, 15 | kernel_size=ksize, 16 | stride=stride, 17 | padding=pad, 18 | dilation=dilation, 19 | bias=False), 20 | nn.BatchNorm2d(nout), 21 | activ() 22 | ) 23 | 24 | def __call__(self, x): 25 | return self.conv(x) 26 | 27 | 28 | class SeperableConv2DBNActiv(nn.Module): 29 | 30 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 31 | super(SeperableConv2DBNActiv, self).__init__() 32 | self.conv = nn.Sequential( 33 | nn.Conv2d( 34 | nin, nin, 35 | kernel_size=ksize, 36 | stride=stride, 37 | padding=pad, 38 | dilation=dilation, 39 | groups=nin, 40 | bias=False), 41 | nn.Conv2d( 42 | nin, nout, 43 | kernel_size=1, 44 | bias=False), 45 | nn.BatchNorm2d(nout), 46 | activ() 47 | ) 48 | 49 | def __call__(self, x): 50 | return self.conv(x) 51 | 52 | 53 | class Encoder(nn.Module): 54 | 55 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU): 56 | super(Encoder, self).__init__() 57 | self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 58 | self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ) 59 | 60 | def __call__(self, x): 61 | skip = self.conv1(x) 62 | h = self.conv2(skip) 63 | 64 | return h, skip 65 | 66 | 67 | class Decoder(nn.Module): 68 | 69 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False): 70 | super(Decoder, self).__init__() 71 | self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 72 | self.dropout = nn.Dropout2d(0.1) if dropout else None 73 | 74 | def __call__(self, x, skip=None): 75 | x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) 76 | if skip is not None: 77 | skip = spec_utils.crop_center(skip, x) 78 | x = torch.cat([x, skip], dim=1) 79 | h = self.conv(x) 80 | 81 | if self.dropout is not None: 82 | h = self.dropout(h) 83 | 84 | return h 85 | 86 | 87 | class ASPPModule(nn.Module): 88 | 89 | def __init__(self, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU): 90 | super(ASPPModule, self).__init__() 91 | self.conv1 = nn.Sequential( 92 | nn.AdaptiveAvgPool2d((1, None)), 93 | Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 94 | ) 95 | self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 96 | self.conv3 = SeperableConv2DBNActiv( 97 | nin, nin, 3, 1, dilations[0], dilations[0], activ=activ) 98 | self.conv4 = SeperableConv2DBNActiv( 99 | nin, nin, 3, 1, dilations[1], dilations[1], activ=activ) 100 | self.conv5 = SeperableConv2DBNActiv( 101 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 102 | self.bottleneck = nn.Sequential( 103 | Conv2DBNActiv(nin * 5, nout, 1, 1, 0, activ=activ), 104 | nn.Dropout2d(0.1) 105 | ) 106 | 107 | def forward(self, x): 108 | _, _, h, w = x.size() 109 | feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True) 110 | feat2 = self.conv2(x) 111 | feat3 = self.conv3(x) 112 | feat4 = self.conv4(x) 113 | feat5 = self.conv5(x) 114 | out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1) 115 | bottle = self.bottleneck(out) 116 | return bottle 117 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/layers_33966KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import spec_utils 6 | 7 | 8 | class Conv2DBNActiv(nn.Module): 9 | 10 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 11 | super(Conv2DBNActiv, self).__init__() 12 | self.conv = nn.Sequential( 13 | nn.Conv2d( 14 | nin, nout, 15 | kernel_size=ksize, 16 | stride=stride, 17 | padding=pad, 18 | dilation=dilation, 19 | bias=False), 20 | nn.BatchNorm2d(nout), 21 | activ() 22 | ) 23 | 24 | def __call__(self, x): 25 | return self.conv(x) 26 | 27 | 28 | class SeperableConv2DBNActiv(nn.Module): 29 | 30 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 31 | super(SeperableConv2DBNActiv, self).__init__() 32 | self.conv = nn.Sequential( 33 | nn.Conv2d( 34 | nin, nin, 35 | kernel_size=ksize, 36 | stride=stride, 37 | padding=pad, 38 | dilation=dilation, 39 | groups=nin, 40 | bias=False), 41 | nn.Conv2d( 42 | nin, nout, 43 | kernel_size=1, 44 | bias=False), 45 | nn.BatchNorm2d(nout), 46 | activ() 47 | ) 48 | 49 | def __call__(self, x): 50 | return self.conv(x) 51 | 52 | 53 | class Encoder(nn.Module): 54 | 55 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU): 56 | super(Encoder, self).__init__() 57 | self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 58 | self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ) 59 | 60 | def __call__(self, x): 61 | skip = self.conv1(x) 62 | h = self.conv2(skip) 63 | 64 | return h, skip 65 | 66 | 67 | class Decoder(nn.Module): 68 | 69 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False): 70 | super(Decoder, self).__init__() 71 | self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 72 | self.dropout = nn.Dropout2d(0.1) if dropout else None 73 | 74 | def __call__(self, x, skip=None): 75 | x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) 76 | if skip is not None: 77 | skip = spec_utils.crop_center(skip, x) 78 | x = torch.cat([x, skip], dim=1) 79 | h = self.conv(x) 80 | 81 | if self.dropout is not None: 82 | h = self.dropout(h) 83 | 84 | return h 85 | 86 | 87 | class ASPPModule(nn.Module): 88 | 89 | def __init__(self, nin, nout, dilations=(4, 8, 16, 32, 64), activ=nn.ReLU): 90 | super(ASPPModule, self).__init__() 91 | self.conv1 = nn.Sequential( 92 | nn.AdaptiveAvgPool2d((1, None)), 93 | Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 94 | ) 95 | self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 96 | self.conv3 = SeperableConv2DBNActiv( 97 | nin, nin, 3, 1, dilations[0], dilations[0], activ=activ) 98 | self.conv4 = SeperableConv2DBNActiv( 99 | nin, nin, 3, 1, dilations[1], dilations[1], activ=activ) 100 | self.conv5 = SeperableConv2DBNActiv( 101 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 102 | self.conv6 = SeperableConv2DBNActiv( 103 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 104 | self.conv7 = SeperableConv2DBNActiv( 105 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 106 | self.bottleneck = nn.Sequential( 107 | Conv2DBNActiv(nin * 7, nout, 1, 1, 0, activ=activ), 108 | nn.Dropout2d(0.1) 109 | ) 110 | 111 | def forward(self, x): 112 | _, _, h, w = x.size() 113 | feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True) 114 | feat2 = self.conv2(x) 115 | feat3 = self.conv3(x) 116 | feat4 = self.conv4(x) 117 | feat5 = self.conv5(x) 118 | feat6 = self.conv6(x) 119 | feat7 = self.conv7(x) 120 | out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1) 121 | bottle = self.bottleneck(out) 122 | return bottle 123 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/layers_537227KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import spec_utils 6 | 7 | 8 | class Conv2DBNActiv(nn.Module): 9 | 10 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 11 | super(Conv2DBNActiv, self).__init__() 12 | self.conv = nn.Sequential( 13 | nn.Conv2d( 14 | nin, nout, 15 | kernel_size=ksize, 16 | stride=stride, 17 | padding=pad, 18 | dilation=dilation, 19 | bias=False), 20 | nn.BatchNorm2d(nout), 21 | activ() 22 | ) 23 | 24 | def __call__(self, x): 25 | return self.conv(x) 26 | 27 | 28 | class SeperableConv2DBNActiv(nn.Module): 29 | 30 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 31 | super(SeperableConv2DBNActiv, self).__init__() 32 | self.conv = nn.Sequential( 33 | nn.Conv2d( 34 | nin, nin, 35 | kernel_size=ksize, 36 | stride=stride, 37 | padding=pad, 38 | dilation=dilation, 39 | groups=nin, 40 | bias=False), 41 | nn.Conv2d( 42 | nin, nout, 43 | kernel_size=1, 44 | bias=False), 45 | nn.BatchNorm2d(nout), 46 | activ() 47 | ) 48 | 49 | def __call__(self, x): 50 | return self.conv(x) 51 | 52 | 53 | class Encoder(nn.Module): 54 | 55 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU): 56 | super(Encoder, self).__init__() 57 | self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 58 | self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ) 59 | 60 | def __call__(self, x): 61 | skip = self.conv1(x) 62 | h = self.conv2(skip) 63 | 64 | return h, skip 65 | 66 | 67 | class Decoder(nn.Module): 68 | 69 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False): 70 | super(Decoder, self).__init__() 71 | self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 72 | self.dropout = nn.Dropout2d(0.1) if dropout else None 73 | 74 | def __call__(self, x, skip=None): 75 | x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) 76 | if skip is not None: 77 | skip = spec_utils.crop_center(skip, x) 78 | x = torch.cat([x, skip], dim=1) 79 | h = self.conv(x) 80 | 81 | if self.dropout is not None: 82 | h = self.dropout(h) 83 | 84 | return h 85 | 86 | 87 | class ASPPModule(nn.Module): 88 | 89 | def __init__(self, nin, nout, dilations=(4, 8, 16, 32, 64), activ=nn.ReLU): 90 | super(ASPPModule, self).__init__() 91 | self.conv1 = nn.Sequential( 92 | nn.AdaptiveAvgPool2d((1, None)), 93 | Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 94 | ) 95 | self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 96 | self.conv3 = SeperableConv2DBNActiv( 97 | nin, nin, 3, 1, dilations[0], dilations[0], activ=activ) 98 | self.conv4 = SeperableConv2DBNActiv( 99 | nin, nin, 3, 1, dilations[1], dilations[1], activ=activ) 100 | self.conv5 = SeperableConv2DBNActiv( 101 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 102 | self.conv6 = SeperableConv2DBNActiv( 103 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 104 | self.conv7 = SeperableConv2DBNActiv( 105 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 106 | self.bottleneck = nn.Sequential( 107 | Conv2DBNActiv(nin * 7, nout, 1, 1, 0, activ=activ), 108 | nn.Dropout2d(0.1) 109 | ) 110 | 111 | def forward(self, x): 112 | _, _, h, w = x.size() 113 | feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True) 114 | feat2 = self.conv2(x) 115 | feat3 = self.conv3(x) 116 | feat4 = self.conv4(x) 117 | feat5 = self.conv5(x) 118 | feat6 = self.conv6(x) 119 | feat7 = self.conv7(x) 120 | out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1) 121 | bottle = self.bottleneck(out) 122 | return bottle 123 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/layers_537238KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import spec_utils 6 | 7 | 8 | class Conv2DBNActiv(nn.Module): 9 | 10 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 11 | super(Conv2DBNActiv, self).__init__() 12 | self.conv = nn.Sequential( 13 | nn.Conv2d( 14 | nin, nout, 15 | kernel_size=ksize, 16 | stride=stride, 17 | padding=pad, 18 | dilation=dilation, 19 | bias=False), 20 | nn.BatchNorm2d(nout), 21 | activ() 22 | ) 23 | 24 | def __call__(self, x): 25 | return self.conv(x) 26 | 27 | 28 | class SeperableConv2DBNActiv(nn.Module): 29 | 30 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU): 31 | super(SeperableConv2DBNActiv, self).__init__() 32 | self.conv = nn.Sequential( 33 | nn.Conv2d( 34 | nin, nin, 35 | kernel_size=ksize, 36 | stride=stride, 37 | padding=pad, 38 | dilation=dilation, 39 | groups=nin, 40 | bias=False), 41 | nn.Conv2d( 42 | nin, nout, 43 | kernel_size=1, 44 | bias=False), 45 | nn.BatchNorm2d(nout), 46 | activ() 47 | ) 48 | 49 | def __call__(self, x): 50 | return self.conv(x) 51 | 52 | 53 | class Encoder(nn.Module): 54 | 55 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU): 56 | super(Encoder, self).__init__() 57 | self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 58 | self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ) 59 | 60 | def __call__(self, x): 61 | skip = self.conv1(x) 62 | h = self.conv2(skip) 63 | 64 | return h, skip 65 | 66 | 67 | class Decoder(nn.Module): 68 | 69 | def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False): 70 | super(Decoder, self).__init__() 71 | self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ) 72 | self.dropout = nn.Dropout2d(0.1) if dropout else None 73 | 74 | def __call__(self, x, skip=None): 75 | x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True) 76 | if skip is not None: 77 | skip = spec_utils.crop_center(skip, x) 78 | x = torch.cat([x, skip], dim=1) 79 | h = self.conv(x) 80 | 81 | if self.dropout is not None: 82 | h = self.dropout(h) 83 | 84 | return h 85 | 86 | 87 | class ASPPModule(nn.Module): 88 | 89 | def __init__(self, nin, nout, dilations=(4, 8, 16, 32, 64), activ=nn.ReLU): 90 | super(ASPPModule, self).__init__() 91 | self.conv1 = nn.Sequential( 92 | nn.AdaptiveAvgPool2d((1, None)), 93 | Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 94 | ) 95 | self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ) 96 | self.conv3 = SeperableConv2DBNActiv( 97 | nin, nin, 3, 1, dilations[0], dilations[0], activ=activ) 98 | self.conv4 = SeperableConv2DBNActiv( 99 | nin, nin, 3, 1, dilations[1], dilations[1], activ=activ) 100 | self.conv5 = SeperableConv2DBNActiv( 101 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 102 | self.conv6 = SeperableConv2DBNActiv( 103 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 104 | self.conv7 = SeperableConv2DBNActiv( 105 | nin, nin, 3, 1, dilations[2], dilations[2], activ=activ) 106 | self.bottleneck = nn.Sequential( 107 | Conv2DBNActiv(nin * 7, nout, 1, 1, 0, activ=activ), 108 | nn.Dropout2d(0.1) 109 | ) 110 | 111 | def forward(self, x): 112 | _, _, h, w = x.size() 113 | feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True) 114 | feat2 = self.conv2(x) 115 | feat3 = self.conv3(x) 116 | feat4 = self.conv4(x) 117 | feat5 = self.conv5(x) 118 | feat6 = self.conv6(x) 119 | feat7 = self.conv7(x) 120 | out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1) 121 | bottle = self.bottleneck(out) 122 | return bottle 123 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/model_param_init.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import pathlib 4 | 5 | default_param = {} 6 | default_param['bins'] = 768 7 | default_param['unstable_bins'] = 9 # training only 8 | default_param['reduction_bins'] = 762 # training only 9 | default_param['sr'] = 44100 10 | default_param['pre_filter_start'] = 757 11 | default_param['pre_filter_stop'] = 768 12 | default_param['band'] = {} 13 | 14 | 15 | default_param['band'][1] = { 16 | 'sr': 11025, 17 | 'hl': 128, 18 | 'n_fft': 960, 19 | 'crop_start': 0, 20 | 'crop_stop': 245, 21 | 'lpf_start': 61, # inference only 22 | 'res_type': 'polyphase' 23 | } 24 | 25 | default_param['band'][2] = { 26 | 'sr': 44100, 27 | 'hl': 512, 28 | 'n_fft': 1536, 29 | 'crop_start': 24, 30 | 'crop_stop': 547, 31 | 'hpf_start': 81, # inference only 32 | 'res_type': 'sinc_best' 33 | } 34 | 35 | 36 | def int_keys(d): 37 | r = {} 38 | for k, v in d: 39 | if k.isdigit(): 40 | k = int(k) 41 | r[k] = v 42 | return r 43 | 44 | 45 | class ModelParameters(object): 46 | def __init__(self, config_path=''): 47 | if '.pth' == pathlib.Path(config_path).suffix: 48 | import zipfile 49 | 50 | with zipfile.ZipFile(config_path, 'r') as zip: 51 | self.param = json.loads(zip.read('param.json'), object_pairs_hook=int_keys) 52 | elif '.json' == pathlib.Path(config_path).suffix: 53 | with open(config_path, 'r') as f: 54 | self.param = json.loads(f.read(), object_pairs_hook=int_keys) 55 | else: 56 | self.param = default_param 57 | 58 | for k in ['mid_side', 'mid_side_b', 'mid_side_b2', 'stereo_w', 'stereo_n', 'reverse']: 59 | if not k in self.param: 60 | self.param[k] = False -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr16000_hl512.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 1024, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 16000, 8 | "hl": 512, 9 | "n_fft": 2048, 10 | "crop_start": 0, 11 | "crop_stop": 1024, 12 | "hpf_start": -1, 13 | "res_type": "sinc_best" 14 | } 15 | }, 16 | "sr": 16000, 17 | "pre_filter_start": 1023, 18 | "pre_filter_stop": 1024 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 1024, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 32000, 8 | "hl": 512, 9 | "n_fft": 2048, 10 | "crop_start": 0, 11 | "crop_stop": 1024, 12 | "hpf_start": -1, 13 | "res_type": "kaiser_fast" 14 | } 15 | }, 16 | "sr": 32000, 17 | "pre_filter_start": 1000, 18 | "pre_filter_stop": 1021 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 1024, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 33075, 8 | "hl": 384, 9 | "n_fft": 2048, 10 | "crop_start": 0, 11 | "crop_stop": 1024, 12 | "hpf_start": -1, 13 | "res_type": "sinc_best" 14 | } 15 | }, 16 | "sr": 33075, 17 | "pre_filter_start": 1000, 18 | "pre_filter_stop": 1021 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr44100_hl1024.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 1024, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 44100, 8 | "hl": 1024, 9 | "n_fft": 2048, 10 | "crop_start": 0, 11 | "crop_stop": 1024, 12 | "hpf_start": -1, 13 | "res_type": "sinc_best" 14 | } 15 | }, 16 | "sr": 44100, 17 | "pre_filter_start": 1023, 18 | "pre_filter_stop": 1024 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr44100_hl256.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 256, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 44100, 8 | "hl": 256, 9 | "n_fft": 512, 10 | "crop_start": 0, 11 | "crop_stop": 256, 12 | "hpf_start": -1, 13 | "res_type": "sinc_best" 14 | } 15 | }, 16 | "sr": 44100, 17 | "pre_filter_start": 256, 18 | "pre_filter_stop": 256 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 1024, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 44100, 8 | "hl": 512, 9 | "n_fft": 2048, 10 | "crop_start": 0, 11 | "crop_stop": 1024, 12 | "hpf_start": -1, 13 | "res_type": "sinc_best" 14 | } 15 | }, 16 | "sr": 44100, 17 | "pre_filter_start": 1023, 18 | "pre_filter_stop": 1024 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512_cut.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 1024, 3 | "unstable_bins": 0, 4 | "reduction_bins": 0, 5 | "band": { 6 | "1": { 7 | "sr": 44100, 8 | "hl": 512, 9 | "n_fft": 2048, 10 | "crop_start": 0, 11 | "crop_stop": 700, 12 | "hpf_start": -1, 13 | "res_type": "sinc_best" 14 | } 15 | }, 16 | "sr": 44100, 17 | "pre_filter_start": 1023, 18 | "pre_filter_stop": 700 19 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/2band_32000.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 768, 3 | "unstable_bins": 7, 4 | "reduction_bins": 705, 5 | "band": { 6 | "1": { 7 | "sr": 6000, 8 | "hl": 66, 9 | "n_fft": 512, 10 | "crop_start": 0, 11 | "crop_stop": 240, 12 | "lpf_start": 60, 13 | "lpf_stop": 118, 14 | "res_type": "sinc_fastest" 15 | }, 16 | "2": { 17 | "sr": 32000, 18 | "hl": 352, 19 | "n_fft": 1024, 20 | "crop_start": 22, 21 | "crop_stop": 505, 22 | "hpf_start": 44, 23 | "hpf_stop": 23, 24 | "res_type": "sinc_medium" 25 | } 26 | }, 27 | "sr": 32000, 28 | "pre_filter_start": 710, 29 | "pre_filter_stop": 731 30 | } 31 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/2band_44100_lofi.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 512, 3 | "unstable_bins": 7, 4 | "reduction_bins": 510, 5 | "band": { 6 | "1": { 7 | "sr": 11025, 8 | "hl": 160, 9 | "n_fft": 768, 10 | "crop_start": 0, 11 | "crop_stop": 192, 12 | "lpf_start": 41, 13 | "lpf_stop": 139, 14 | "res_type": "sinc_fastest" 15 | }, 16 | "2": { 17 | "sr": 44100, 18 | "hl": 640, 19 | "n_fft": 1024, 20 | "crop_start": 10, 21 | "crop_stop": 320, 22 | "hpf_start": 47, 23 | "hpf_stop": 15, 24 | "res_type": "sinc_medium" 25 | } 26 | }, 27 | "sr": 44100, 28 | "pre_filter_start": 510, 29 | "pre_filter_stop": 512 30 | } 31 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/2band_48000.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 768, 3 | "unstable_bins": 7, 4 | "reduction_bins": 705, 5 | "band": { 6 | "1": { 7 | "sr": 6000, 8 | "hl": 66, 9 | "n_fft": 512, 10 | "crop_start": 0, 11 | "crop_stop": 240, 12 | "lpf_start": 60, 13 | "lpf_stop": 240, 14 | "res_type": "sinc_fastest" 15 | }, 16 | "2": { 17 | "sr": 48000, 18 | "hl": 528, 19 | "n_fft": 1536, 20 | "crop_start": 22, 21 | "crop_stop": 505, 22 | "hpf_start": 82, 23 | "hpf_stop": 22, 24 | "res_type": "sinc_medium" 25 | } 26 | }, 27 | "sr": 48000, 28 | "pre_filter_start": 710, 29 | "pre_filter_stop": 731 30 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/3band_44100.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 768, 3 | "unstable_bins": 5, 4 | "reduction_bins": 733, 5 | "band": { 6 | "1": { 7 | "sr": 11025, 8 | "hl": 128, 9 | "n_fft": 768, 10 | "crop_start": 0, 11 | "crop_stop": 278, 12 | "lpf_start": 28, 13 | "lpf_stop": 140, 14 | "res_type": "polyphase" 15 | }, 16 | "2": { 17 | "sr": 22050, 18 | "hl": 256, 19 | "n_fft": 768, 20 | "crop_start": 14, 21 | "crop_stop": 322, 22 | "hpf_start": 70, 23 | "hpf_stop": 14, 24 | "lpf_start": 283, 25 | "lpf_stop": 314, 26 | "res_type": "polyphase" 27 | }, 28 | "3": { 29 | "sr": 44100, 30 | "hl": 512, 31 | "n_fft": 768, 32 | "crop_start": 131, 33 | "crop_stop": 313, 34 | "hpf_start": 154, 35 | "hpf_stop": 141, 36 | "res_type": "sinc_medium" 37 | } 38 | }, 39 | "sr": 44100, 40 | "pre_filter_start": 757, 41 | "pre_filter_stop": 768 42 | } 43 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/3band_44100_mid.json: -------------------------------------------------------------------------------- 1 | { 2 | "mid_side": true, 3 | "bins": 768, 4 | "unstable_bins": 5, 5 | "reduction_bins": 733, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 128, 10 | "n_fft": 768, 11 | "crop_start": 0, 12 | "crop_stop": 278, 13 | "lpf_start": 28, 14 | "lpf_stop": 140, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 22050, 19 | "hl": 256, 20 | "n_fft": 768, 21 | "crop_start": 14, 22 | "crop_stop": 322, 23 | "hpf_start": 70, 24 | "hpf_stop": 14, 25 | "lpf_start": 283, 26 | "lpf_stop": 314, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 44100, 31 | "hl": 512, 32 | "n_fft": 768, 33 | "crop_start": 131, 34 | "crop_stop": 313, 35 | "hpf_start": 154, 36 | "hpf_stop": 141, 37 | "res_type": "sinc_medium" 38 | } 39 | }, 40 | "sr": 44100, 41 | "pre_filter_start": 757, 42 | "pre_filter_stop": 768 43 | } 44 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json: -------------------------------------------------------------------------------- 1 | { 2 | "mid_side_b2": true, 3 | "bins": 640, 4 | "unstable_bins": 7, 5 | "reduction_bins": 565, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 108, 10 | "n_fft": 1024, 11 | "crop_start": 0, 12 | "crop_stop": 187, 13 | "lpf_start": 92, 14 | "lpf_stop": 186, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 22050, 19 | "hl": 216, 20 | "n_fft": 768, 21 | "crop_start": 0, 22 | "crop_stop": 212, 23 | "hpf_start": 68, 24 | "hpf_stop": 34, 25 | "lpf_start": 174, 26 | "lpf_stop": 209, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 44100, 31 | "hl": 432, 32 | "n_fft": 640, 33 | "crop_start": 66, 34 | "crop_stop": 307, 35 | "hpf_start": 86, 36 | "hpf_stop": 72, 37 | "res_type": "kaiser_fast" 38 | } 39 | }, 40 | "sr": 44100, 41 | "pre_filter_start": 639, 42 | "pre_filter_stop": 640 43 | } 44 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_44100.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 768, 3 | "unstable_bins": 7, 4 | "reduction_bins": 668, 5 | "band": { 6 | "1": { 7 | "sr": 11025, 8 | "hl": 128, 9 | "n_fft": 1024, 10 | "crop_start": 0, 11 | "crop_stop": 186, 12 | "lpf_start": 37, 13 | "lpf_stop": 73, 14 | "res_type": "polyphase" 15 | }, 16 | "2": { 17 | "sr": 11025, 18 | "hl": 128, 19 | "n_fft": 512, 20 | "crop_start": 4, 21 | "crop_stop": 185, 22 | "hpf_start": 36, 23 | "hpf_stop": 18, 24 | "lpf_start": 93, 25 | "lpf_stop": 185, 26 | "res_type": "polyphase" 27 | }, 28 | "3": { 29 | "sr": 22050, 30 | "hl": 256, 31 | "n_fft": 512, 32 | "crop_start": 46, 33 | "crop_stop": 186, 34 | "hpf_start": 93, 35 | "hpf_stop": 46, 36 | "lpf_start": 164, 37 | "lpf_stop": 186, 38 | "res_type": "polyphase" 39 | }, 40 | "4": { 41 | "sr": 44100, 42 | "hl": 512, 43 | "n_fft": 768, 44 | "crop_start": 121, 45 | "crop_stop": 382, 46 | "hpf_start": 138, 47 | "hpf_stop": 123, 48 | "res_type": "sinc_medium" 49 | } 50 | }, 51 | "sr": 44100, 52 | "pre_filter_start": 740, 53 | "pre_filter_stop": 768 54 | } 55 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_44100_mid.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 768, 3 | "unstable_bins": 7, 4 | "mid_side": true, 5 | "reduction_bins": 668, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 128, 10 | "n_fft": 1024, 11 | "crop_start": 0, 12 | "crop_stop": 186, 13 | "lpf_start": 37, 14 | "lpf_stop": 73, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 11025, 19 | "hl": 128, 20 | "n_fft": 512, 21 | "crop_start": 4, 22 | "crop_stop": 185, 23 | "hpf_start": 36, 24 | "hpf_stop": 18, 25 | "lpf_start": 93, 26 | "lpf_stop": 185, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 22050, 31 | "hl": 256, 32 | "n_fft": 512, 33 | "crop_start": 46, 34 | "crop_stop": 186, 35 | "hpf_start": 93, 36 | "hpf_stop": 46, 37 | "lpf_start": 164, 38 | "lpf_stop": 186, 39 | "res_type": "polyphase" 40 | }, 41 | "4": { 42 | "sr": 44100, 43 | "hl": 512, 44 | "n_fft": 768, 45 | "crop_start": 121, 46 | "crop_stop": 382, 47 | "hpf_start": 138, 48 | "hpf_stop": 123, 49 | "res_type": "sinc_medium" 50 | } 51 | }, 52 | "sr": 44100, 53 | "pre_filter_start": 740, 54 | "pre_filter_stop": 768 55 | } 56 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_44100_msb.json: -------------------------------------------------------------------------------- 1 | { 2 | "mid_side_b": true, 3 | "bins": 768, 4 | "unstable_bins": 7, 5 | "reduction_bins": 668, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 128, 10 | "n_fft": 1024, 11 | "crop_start": 0, 12 | "crop_stop": 186, 13 | "lpf_start": 37, 14 | "lpf_stop": 73, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 11025, 19 | "hl": 128, 20 | "n_fft": 512, 21 | "crop_start": 4, 22 | "crop_stop": 185, 23 | "hpf_start": 36, 24 | "hpf_stop": 18, 25 | "lpf_start": 93, 26 | "lpf_stop": 185, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 22050, 31 | "hl": 256, 32 | "n_fft": 512, 33 | "crop_start": 46, 34 | "crop_stop": 186, 35 | "hpf_start": 93, 36 | "hpf_stop": 46, 37 | "lpf_start": 164, 38 | "lpf_stop": 186, 39 | "res_type": "polyphase" 40 | }, 41 | "4": { 42 | "sr": 44100, 43 | "hl": 512, 44 | "n_fft": 768, 45 | "crop_start": 121, 46 | "crop_stop": 382, 47 | "hpf_start": 138, 48 | "hpf_stop": 123, 49 | "res_type": "sinc_medium" 50 | } 51 | }, 52 | "sr": 44100, 53 | "pre_filter_start": 740, 54 | "pre_filter_stop": 768 55 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_44100_msb2.json: -------------------------------------------------------------------------------- 1 | { 2 | "mid_side_b": true, 3 | "bins": 768, 4 | "unstable_bins": 7, 5 | "reduction_bins": 668, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 128, 10 | "n_fft": 1024, 11 | "crop_start": 0, 12 | "crop_stop": 186, 13 | "lpf_start": 37, 14 | "lpf_stop": 73, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 11025, 19 | "hl": 128, 20 | "n_fft": 512, 21 | "crop_start": 4, 22 | "crop_stop": 185, 23 | "hpf_start": 36, 24 | "hpf_stop": 18, 25 | "lpf_start": 93, 26 | "lpf_stop": 185, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 22050, 31 | "hl": 256, 32 | "n_fft": 512, 33 | "crop_start": 46, 34 | "crop_stop": 186, 35 | "hpf_start": 93, 36 | "hpf_stop": 46, 37 | "lpf_start": 164, 38 | "lpf_stop": 186, 39 | "res_type": "polyphase" 40 | }, 41 | "4": { 42 | "sr": 44100, 43 | "hl": 512, 44 | "n_fft": 768, 45 | "crop_start": 121, 46 | "crop_stop": 382, 47 | "hpf_start": 138, 48 | "hpf_stop": 123, 49 | "res_type": "sinc_medium" 50 | } 51 | }, 52 | "sr": 44100, 53 | "pre_filter_start": 740, 54 | "pre_filter_stop": 768 55 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_44100_reverse.json: -------------------------------------------------------------------------------- 1 | { 2 | "reverse": true, 3 | "bins": 768, 4 | "unstable_bins": 7, 5 | "reduction_bins": 668, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 128, 10 | "n_fft": 1024, 11 | "crop_start": 0, 12 | "crop_stop": 186, 13 | "lpf_start": 37, 14 | "lpf_stop": 73, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 11025, 19 | "hl": 128, 20 | "n_fft": 512, 21 | "crop_start": 4, 22 | "crop_stop": 185, 23 | "hpf_start": 36, 24 | "hpf_stop": 18, 25 | "lpf_start": 93, 26 | "lpf_stop": 185, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 22050, 31 | "hl": 256, 32 | "n_fft": 512, 33 | "crop_start": 46, 34 | "crop_stop": 186, 35 | "hpf_start": 93, 36 | "hpf_stop": 46, 37 | "lpf_start": 164, 38 | "lpf_stop": 186, 39 | "res_type": "polyphase" 40 | }, 41 | "4": { 42 | "sr": 44100, 43 | "hl": 512, 44 | "n_fft": 768, 45 | "crop_start": 121, 46 | "crop_stop": 382, 47 | "hpf_start": 138, 48 | "hpf_stop": 123, 49 | "res_type": "sinc_medium" 50 | } 51 | }, 52 | "sr": 44100, 53 | "pre_filter_start": 740, 54 | "pre_filter_stop": 768 55 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_44100_sw.json: -------------------------------------------------------------------------------- 1 | { 2 | "stereo_w": true, 3 | "bins": 768, 4 | "unstable_bins": 7, 5 | "reduction_bins": 668, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 128, 10 | "n_fft": 1024, 11 | "crop_start": 0, 12 | "crop_stop": 186, 13 | "lpf_start": 37, 14 | "lpf_stop": 73, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 11025, 19 | "hl": 128, 20 | "n_fft": 512, 21 | "crop_start": 4, 22 | "crop_stop": 185, 23 | "hpf_start": 36, 24 | "hpf_stop": 18, 25 | "lpf_start": 93, 26 | "lpf_stop": 185, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 22050, 31 | "hl": 256, 32 | "n_fft": 512, 33 | "crop_start": 46, 34 | "crop_stop": 186, 35 | "hpf_start": 93, 36 | "hpf_stop": 46, 37 | "lpf_start": 164, 38 | "lpf_stop": 186, 39 | "res_type": "polyphase" 40 | }, 41 | "4": { 42 | "sr": 44100, 43 | "hl": 512, 44 | "n_fft": 768, 45 | "crop_start": 121, 46 | "crop_stop": 382, 47 | "hpf_start": 138, 48 | "hpf_stop": 123, 49 | "res_type": "sinc_medium" 50 | } 51 | }, 52 | "sr": 44100, 53 | "pre_filter_start": 740, 54 | "pre_filter_stop": 768 55 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_v2.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 672, 3 | "unstable_bins": 8, 4 | "reduction_bins": 637, 5 | "band": { 6 | "1": { 7 | "sr": 7350, 8 | "hl": 80, 9 | "n_fft": 640, 10 | "crop_start": 0, 11 | "crop_stop": 85, 12 | "lpf_start": 25, 13 | "lpf_stop": 53, 14 | "res_type": "polyphase" 15 | }, 16 | "2": { 17 | "sr": 7350, 18 | "hl": 80, 19 | "n_fft": 320, 20 | "crop_start": 4, 21 | "crop_stop": 87, 22 | "hpf_start": 25, 23 | "hpf_stop": 12, 24 | "lpf_start": 31, 25 | "lpf_stop": 62, 26 | "res_type": "polyphase" 27 | }, 28 | "3": { 29 | "sr": 14700, 30 | "hl": 160, 31 | "n_fft": 512, 32 | "crop_start": 17, 33 | "crop_stop": 216, 34 | "hpf_start": 48, 35 | "hpf_stop": 24, 36 | "lpf_start": 139, 37 | "lpf_stop": 210, 38 | "res_type": "polyphase" 39 | }, 40 | "4": { 41 | "sr": 44100, 42 | "hl": 480, 43 | "n_fft": 960, 44 | "crop_start": 78, 45 | "crop_stop": 383, 46 | "hpf_start": 130, 47 | "hpf_stop": 86, 48 | "res_type": "kaiser_fast" 49 | } 50 | }, 51 | "sr": 44100, 52 | "pre_filter_start": 668, 53 | "pre_filter_stop": 672 54 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/4band_v2_sn.json: -------------------------------------------------------------------------------- 1 | { 2 | "bins": 672, 3 | "unstable_bins": 8, 4 | "reduction_bins": 637, 5 | "band": { 6 | "1": { 7 | "sr": 7350, 8 | "hl": 80, 9 | "n_fft": 640, 10 | "crop_start": 0, 11 | "crop_stop": 85, 12 | "lpf_start": 25, 13 | "lpf_stop": 53, 14 | "res_type": "polyphase" 15 | }, 16 | "2": { 17 | "sr": 7350, 18 | "hl": 80, 19 | "n_fft": 320, 20 | "crop_start": 4, 21 | "crop_stop": 87, 22 | "hpf_start": 25, 23 | "hpf_stop": 12, 24 | "lpf_start": 31, 25 | "lpf_stop": 62, 26 | "res_type": "polyphase" 27 | }, 28 | "3": { 29 | "sr": 14700, 30 | "hl": 160, 31 | "n_fft": 512, 32 | "crop_start": 17, 33 | "crop_stop": 216, 34 | "hpf_start": 48, 35 | "hpf_stop": 24, 36 | "lpf_start": 139, 37 | "lpf_stop": 210, 38 | "res_type": "polyphase" 39 | }, 40 | "4": { 41 | "sr": 44100, 42 | "hl": 480, 43 | "n_fft": 960, 44 | "crop_start": 78, 45 | "crop_stop": 383, 46 | "hpf_start": 130, 47 | "hpf_stop": 86, 48 | "convert_channels": "stereo_n", 49 | "res_type": "kaiser_fast" 50 | } 51 | }, 52 | "sr": 44100, 53 | "pre_filter_start": 668, 54 | "pre_filter_stop": 672 55 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/modelparams/ensemble.json: -------------------------------------------------------------------------------- 1 | { 2 | "mid_side_b2": true, 3 | "bins": 1280, 4 | "unstable_bins": 7, 5 | "reduction_bins": 565, 6 | "band": { 7 | "1": { 8 | "sr": 11025, 9 | "hl": 108, 10 | "n_fft": 2048, 11 | "crop_start": 0, 12 | "crop_stop": 374, 13 | "lpf_start": 92, 14 | "lpf_stop": 186, 15 | "res_type": "polyphase" 16 | }, 17 | "2": { 18 | "sr": 22050, 19 | "hl": 216, 20 | "n_fft": 1536, 21 | "crop_start": 0, 22 | "crop_stop": 424, 23 | "hpf_start": 68, 24 | "hpf_stop": 34, 25 | "lpf_start": 348, 26 | "lpf_stop": 418, 27 | "res_type": "polyphase" 28 | }, 29 | "3": { 30 | "sr": 44100, 31 | "hl": 432, 32 | "n_fft": 1280, 33 | "crop_start": 132, 34 | "crop_stop": 614, 35 | "hpf_start": 172, 36 | "hpf_stop": 144, 37 | "res_type": "polyphase" 38 | } 39 | }, 40 | "sr": 44100, 41 | "pre_filter_start": 1280, 42 | "pre_filter_stop": 1280 43 | } -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import layers 6 | from uvr5_pack.lib_v5 import spec_utils 7 | 8 | 9 | class BaseASPPNet(nn.Module): 10 | 11 | def __init__(self, nin, ch, dilations=(4, 8, 16)): 12 | super(BaseASPPNet, self).__init__() 13 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 14 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 15 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 16 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 17 | 18 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 19 | 20 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 21 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 22 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 23 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 24 | 25 | def __call__(self, x): 26 | h, e1 = self.enc1(x) 27 | h, e2 = self.enc2(h) 28 | h, e3 = self.enc3(h) 29 | h, e4 = self.enc4(h) 30 | 31 | h = self.aspp(h) 32 | 33 | h = self.dec4(h, e4) 34 | h = self.dec3(h, e3) 35 | h = self.dec2(h, e2) 36 | h = self.dec1(h, e1) 37 | 38 | return h 39 | 40 | 41 | class CascadedASPPNet(nn.Module): 42 | 43 | def __init__(self, n_fft): 44 | super(CascadedASPPNet, self).__init__() 45 | self.stg1_low_band_net = BaseASPPNet(2, 16) 46 | self.stg1_high_band_net = BaseASPPNet(2, 16) 47 | 48 | self.stg2_bridge = layers.Conv2DBNActiv(18, 8, 1, 1, 0) 49 | self.stg2_full_band_net = BaseASPPNet(8, 16) 50 | 51 | self.stg3_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0) 52 | self.stg3_full_band_net = BaseASPPNet(16, 32) 53 | 54 | self.out = nn.Conv2d(32, 2, 1, bias=False) 55 | self.aux1_out = nn.Conv2d(16, 2, 1, bias=False) 56 | self.aux2_out = nn.Conv2d(16, 2, 1, bias=False) 57 | 58 | self.max_bin = n_fft // 2 59 | self.output_bin = n_fft // 2 + 1 60 | 61 | self.offset = 128 62 | 63 | def forward(self, x, aggressiveness=None): 64 | mix = x.detach() 65 | x = x.clone() 66 | 67 | x = x[:, :, :self.max_bin] 68 | 69 | bandw = x.size()[2] // 2 70 | aux1 = torch.cat([ 71 | self.stg1_low_band_net(x[:, :, :bandw]), 72 | self.stg1_high_band_net(x[:, :, bandw:]) 73 | ], dim=2) 74 | 75 | h = torch.cat([x, aux1], dim=1) 76 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 77 | 78 | h = torch.cat([x, aux1, aux2], dim=1) 79 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 80 | 81 | mask = torch.sigmoid(self.out(h)) 82 | mask = F.pad( 83 | input=mask, 84 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 85 | mode='replicate') 86 | 87 | if self.training: 88 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 89 | aux1 = F.pad( 90 | input=aux1, 91 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 92 | mode='replicate') 93 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 94 | aux2 = F.pad( 95 | input=aux2, 96 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 97 | mode='replicate') 98 | return mask * mix, aux1 * mix, aux2 * mix 99 | else: 100 | if aggressiveness: 101 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 102 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 103 | 104 | return mask * mix 105 | 106 | def predict(self, x_mag, aggressiveness=None): 107 | h = self.forward(x_mag, aggressiveness) 108 | 109 | if self.offset > 0: 110 | h = h[:, :, :, self.offset:-self.offset] 111 | assert h.size()[3] > 0 112 | 113 | return h 114 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets_123812KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import layers_123821KB as layers 6 | 7 | 8 | class BaseASPPNet(nn.Module): 9 | 10 | def __init__(self, nin, ch, dilations=(4, 8, 16)): 11 | super(BaseASPPNet, self).__init__() 12 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 13 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 14 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 15 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 16 | 17 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 18 | 19 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 20 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 21 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 22 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 23 | 24 | def __call__(self, x): 25 | h, e1 = self.enc1(x) 26 | h, e2 = self.enc2(h) 27 | h, e3 = self.enc3(h) 28 | h, e4 = self.enc4(h) 29 | 30 | h = self.aspp(h) 31 | 32 | h = self.dec4(h, e4) 33 | h = self.dec3(h, e3) 34 | h = self.dec2(h, e2) 35 | h = self.dec1(h, e1) 36 | 37 | return h 38 | 39 | 40 | class CascadedASPPNet(nn.Module): 41 | 42 | def __init__(self, n_fft): 43 | super(CascadedASPPNet, self).__init__() 44 | self.stg1_low_band_net = BaseASPPNet(2, 32) 45 | self.stg1_high_band_net = BaseASPPNet(2, 32) 46 | 47 | self.stg2_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0) 48 | self.stg2_full_band_net = BaseASPPNet(16, 32) 49 | 50 | self.stg3_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0) 51 | self.stg3_full_band_net = BaseASPPNet(32, 64) 52 | 53 | self.out = nn.Conv2d(64, 2, 1, bias=False) 54 | self.aux1_out = nn.Conv2d(32, 2, 1, bias=False) 55 | self.aux2_out = nn.Conv2d(32, 2, 1, bias=False) 56 | 57 | self.max_bin = n_fft // 2 58 | self.output_bin = n_fft // 2 + 1 59 | 60 | self.offset = 128 61 | 62 | def forward(self, x, aggressiveness=None): 63 | mix = x.detach() 64 | x = x.clone() 65 | 66 | x = x[:, :, :self.max_bin] 67 | 68 | bandw = x.size()[2] // 2 69 | aux1 = torch.cat([ 70 | self.stg1_low_band_net(x[:, :, :bandw]), 71 | self.stg1_high_band_net(x[:, :, bandw:]) 72 | ], dim=2) 73 | 74 | h = torch.cat([x, aux1], dim=1) 75 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 76 | 77 | h = torch.cat([x, aux1, aux2], dim=1) 78 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 79 | 80 | mask = torch.sigmoid(self.out(h)) 81 | mask = F.pad( 82 | input=mask, 83 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 84 | mode='replicate') 85 | 86 | if self.training: 87 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 88 | aux1 = F.pad( 89 | input=aux1, 90 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 91 | mode='replicate') 92 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 93 | aux2 = F.pad( 94 | input=aux2, 95 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 96 | mode='replicate') 97 | return mask * mix, aux1 * mix, aux2 * mix 98 | else: 99 | if aggressiveness: 100 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 101 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 102 | 103 | return mask * mix 104 | 105 | def predict(self, x_mag, aggressiveness=None): 106 | h = self.forward(x_mag, aggressiveness) 107 | 108 | if self.offset > 0: 109 | h = h[:, :, :, self.offset:-self.offset] 110 | assert h.size()[3] > 0 111 | 112 | return h 113 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets_123821KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import layers_123821KB as layers 6 | 7 | 8 | class BaseASPPNet(nn.Module): 9 | 10 | def __init__(self, nin, ch, dilations=(4, 8, 16)): 11 | super(BaseASPPNet, self).__init__() 12 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 13 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 14 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 15 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 16 | 17 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 18 | 19 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 20 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 21 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 22 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 23 | 24 | def __call__(self, x): 25 | h, e1 = self.enc1(x) 26 | h, e2 = self.enc2(h) 27 | h, e3 = self.enc3(h) 28 | h, e4 = self.enc4(h) 29 | 30 | h = self.aspp(h) 31 | 32 | h = self.dec4(h, e4) 33 | h = self.dec3(h, e3) 34 | h = self.dec2(h, e2) 35 | h = self.dec1(h, e1) 36 | 37 | return h 38 | 39 | 40 | class CascadedASPPNet(nn.Module): 41 | 42 | def __init__(self, n_fft): 43 | super(CascadedASPPNet, self).__init__() 44 | self.stg1_low_band_net = BaseASPPNet(2, 32) 45 | self.stg1_high_band_net = BaseASPPNet(2, 32) 46 | 47 | self.stg2_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0) 48 | self.stg2_full_band_net = BaseASPPNet(16, 32) 49 | 50 | self.stg3_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0) 51 | self.stg3_full_band_net = BaseASPPNet(32, 64) 52 | 53 | self.out = nn.Conv2d(64, 2, 1, bias=False) 54 | self.aux1_out = nn.Conv2d(32, 2, 1, bias=False) 55 | self.aux2_out = nn.Conv2d(32, 2, 1, bias=False) 56 | 57 | self.max_bin = n_fft // 2 58 | self.output_bin = n_fft // 2 + 1 59 | 60 | self.offset = 128 61 | 62 | def forward(self, x, aggressiveness=None): 63 | mix = x.detach() 64 | x = x.clone() 65 | 66 | x = x[:, :, :self.max_bin] 67 | 68 | bandw = x.size()[2] // 2 69 | aux1 = torch.cat([ 70 | self.stg1_low_band_net(x[:, :, :bandw]), 71 | self.stg1_high_band_net(x[:, :, bandw:]) 72 | ], dim=2) 73 | 74 | h = torch.cat([x, aux1], dim=1) 75 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 76 | 77 | h = torch.cat([x, aux1, aux2], dim=1) 78 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 79 | 80 | mask = torch.sigmoid(self.out(h)) 81 | mask = F.pad( 82 | input=mask, 83 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 84 | mode='replicate') 85 | 86 | if self.training: 87 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 88 | aux1 = F.pad( 89 | input=aux1, 90 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 91 | mode='replicate') 92 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 93 | aux2 = F.pad( 94 | input=aux2, 95 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 96 | mode='replicate') 97 | return mask * mix, aux1 * mix, aux2 * mix 98 | else: 99 | if aggressiveness: 100 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 101 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 102 | 103 | return mask * mix 104 | 105 | def predict(self, x_mag, aggressiveness=None): 106 | h = self.forward(x_mag, aggressiveness) 107 | 108 | if self.offset > 0: 109 | h = h[:, :, :, self.offset:-self.offset] 110 | assert h.size()[3] > 0 111 | 112 | return h 113 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets_33966KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import layers_33966KB as layers 6 | 7 | 8 | class BaseASPPNet(nn.Module): 9 | 10 | def __init__(self, nin, ch, dilations=(4, 8, 16, 32)): 11 | super(BaseASPPNet, self).__init__() 12 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 13 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 14 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 15 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 16 | 17 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 18 | 19 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 20 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 21 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 22 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 23 | 24 | def __call__(self, x): 25 | h, e1 = self.enc1(x) 26 | h, e2 = self.enc2(h) 27 | h, e3 = self.enc3(h) 28 | h, e4 = self.enc4(h) 29 | 30 | h = self.aspp(h) 31 | 32 | h = self.dec4(h, e4) 33 | h = self.dec3(h, e3) 34 | h = self.dec2(h, e2) 35 | h = self.dec1(h, e1) 36 | 37 | return h 38 | 39 | 40 | class CascadedASPPNet(nn.Module): 41 | 42 | def __init__(self, n_fft): 43 | super(CascadedASPPNet, self).__init__() 44 | self.stg1_low_band_net = BaseASPPNet(2, 16) 45 | self.stg1_high_band_net = BaseASPPNet(2, 16) 46 | 47 | self.stg2_bridge = layers.Conv2DBNActiv(18, 8, 1, 1, 0) 48 | self.stg2_full_band_net = BaseASPPNet(8, 16) 49 | 50 | self.stg3_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0) 51 | self.stg3_full_band_net = BaseASPPNet(16, 32) 52 | 53 | self.out = nn.Conv2d(32, 2, 1, bias=False) 54 | self.aux1_out = nn.Conv2d(16, 2, 1, bias=False) 55 | self.aux2_out = nn.Conv2d(16, 2, 1, bias=False) 56 | 57 | self.max_bin = n_fft // 2 58 | self.output_bin = n_fft // 2 + 1 59 | 60 | self.offset = 128 61 | 62 | def forward(self, x, aggressiveness=None): 63 | mix = x.detach() 64 | x = x.clone() 65 | 66 | x = x[:, :, :self.max_bin] 67 | 68 | bandw = x.size()[2] // 2 69 | aux1 = torch.cat([ 70 | self.stg1_low_band_net(x[:, :, :bandw]), 71 | self.stg1_high_band_net(x[:, :, bandw:]) 72 | ], dim=2) 73 | 74 | h = torch.cat([x, aux1], dim=1) 75 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 76 | 77 | h = torch.cat([x, aux1, aux2], dim=1) 78 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 79 | 80 | mask = torch.sigmoid(self.out(h)) 81 | mask = F.pad( 82 | input=mask, 83 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 84 | mode='replicate') 85 | 86 | if self.training: 87 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 88 | aux1 = F.pad( 89 | input=aux1, 90 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 91 | mode='replicate') 92 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 93 | aux2 = F.pad( 94 | input=aux2, 95 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 96 | mode='replicate') 97 | return mask * mix, aux1 * mix, aux2 * mix 98 | else: 99 | if aggressiveness: 100 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 101 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 102 | 103 | return mask * mix 104 | 105 | def predict(self, x_mag, aggressiveness=None): 106 | h = self.forward(x_mag, aggressiveness) 107 | 108 | if self.offset > 0: 109 | h = h[:, :, :, self.offset:-self.offset] 110 | assert h.size()[3] > 0 111 | 112 | return h 113 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets_537227KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from torch import nn 4 | import torch.nn.functional as F 5 | 6 | from uvr5_pack.lib_v5 import layers_537238KB as layers 7 | 8 | 9 | class BaseASPPNet(nn.Module): 10 | 11 | def __init__(self, nin, ch, dilations=(4, 8, 16)): 12 | super(BaseASPPNet, self).__init__() 13 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 14 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 15 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 16 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 17 | 18 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 19 | 20 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 21 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 22 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 23 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 24 | 25 | def __call__(self, x): 26 | h, e1 = self.enc1(x) 27 | h, e2 = self.enc2(h) 28 | h, e3 = self.enc3(h) 29 | h, e4 = self.enc4(h) 30 | 31 | h = self.aspp(h) 32 | 33 | h = self.dec4(h, e4) 34 | h = self.dec3(h, e3) 35 | h = self.dec2(h, e2) 36 | h = self.dec1(h, e1) 37 | 38 | return h 39 | 40 | 41 | class CascadedASPPNet(nn.Module): 42 | 43 | def __init__(self, n_fft): 44 | super(CascadedASPPNet, self).__init__() 45 | self.stg1_low_band_net = BaseASPPNet(2, 64) 46 | self.stg1_high_band_net = BaseASPPNet(2, 64) 47 | 48 | self.stg2_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0) 49 | self.stg2_full_band_net = BaseASPPNet(32, 64) 50 | 51 | self.stg3_bridge = layers.Conv2DBNActiv(130, 64, 1, 1, 0) 52 | self.stg3_full_band_net = BaseASPPNet(64, 128) 53 | 54 | self.out = nn.Conv2d(128, 2, 1, bias=False) 55 | self.aux1_out = nn.Conv2d(64, 2, 1, bias=False) 56 | self.aux2_out = nn.Conv2d(64, 2, 1, bias=False) 57 | 58 | self.max_bin = n_fft // 2 59 | self.output_bin = n_fft // 2 + 1 60 | 61 | self.offset = 128 62 | 63 | def forward(self, x, aggressiveness=None): 64 | mix = x.detach() 65 | x = x.clone() 66 | 67 | x = x[:, :, :self.max_bin] 68 | 69 | bandw = x.size()[2] // 2 70 | aux1 = torch.cat([ 71 | self.stg1_low_band_net(x[:, :, :bandw]), 72 | self.stg1_high_band_net(x[:, :, bandw:]) 73 | ], dim=2) 74 | 75 | h = torch.cat([x, aux1], dim=1) 76 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 77 | 78 | h = torch.cat([x, aux1, aux2], dim=1) 79 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 80 | 81 | mask = torch.sigmoid(self.out(h)) 82 | mask = F.pad( 83 | input=mask, 84 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 85 | mode='replicate') 86 | 87 | if self.training: 88 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 89 | aux1 = F.pad( 90 | input=aux1, 91 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 92 | mode='replicate') 93 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 94 | aux2 = F.pad( 95 | input=aux2, 96 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 97 | mode='replicate') 98 | return mask * mix, aux1 * mix, aux2 * mix 99 | else: 100 | if aggressiveness: 101 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 102 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 103 | 104 | return mask * mix 105 | 106 | def predict(self, x_mag, aggressiveness=None): 107 | h = self.forward(x_mag, aggressiveness) 108 | 109 | if self.offset > 0: 110 | h = h[:, :, :, self.offset:-self.offset] 111 | assert h.size()[3] > 0 112 | 113 | return h 114 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets_537238KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from torch import nn 4 | import torch.nn.functional as F 5 | 6 | from uvr5_pack.lib_v5 import layers_537238KB as layers 7 | 8 | 9 | class BaseASPPNet(nn.Module): 10 | 11 | def __init__(self, nin, ch, dilations=(4, 8, 16)): 12 | super(BaseASPPNet, self).__init__() 13 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 14 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 15 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 16 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 17 | 18 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 19 | 20 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 21 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 22 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 23 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 24 | 25 | def __call__(self, x): 26 | h, e1 = self.enc1(x) 27 | h, e2 = self.enc2(h) 28 | h, e3 = self.enc3(h) 29 | h, e4 = self.enc4(h) 30 | 31 | h = self.aspp(h) 32 | 33 | h = self.dec4(h, e4) 34 | h = self.dec3(h, e3) 35 | h = self.dec2(h, e2) 36 | h = self.dec1(h, e1) 37 | 38 | return h 39 | 40 | 41 | class CascadedASPPNet(nn.Module): 42 | 43 | def __init__(self, n_fft): 44 | super(CascadedASPPNet, self).__init__() 45 | self.stg1_low_band_net = BaseASPPNet(2, 64) 46 | self.stg1_high_band_net = BaseASPPNet(2, 64) 47 | 48 | self.stg2_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0) 49 | self.stg2_full_band_net = BaseASPPNet(32, 64) 50 | 51 | self.stg3_bridge = layers.Conv2DBNActiv(130, 64, 1, 1, 0) 52 | self.stg3_full_band_net = BaseASPPNet(64, 128) 53 | 54 | self.out = nn.Conv2d(128, 2, 1, bias=False) 55 | self.aux1_out = nn.Conv2d(64, 2, 1, bias=False) 56 | self.aux2_out = nn.Conv2d(64, 2, 1, bias=False) 57 | 58 | self.max_bin = n_fft // 2 59 | self.output_bin = n_fft // 2 + 1 60 | 61 | self.offset = 128 62 | 63 | def forward(self, x, aggressiveness=None): 64 | mix = x.detach() 65 | x = x.clone() 66 | 67 | x = x[:, :, :self.max_bin] 68 | 69 | bandw = x.size()[2] // 2 70 | aux1 = torch.cat([ 71 | self.stg1_low_band_net(x[:, :, :bandw]), 72 | self.stg1_high_band_net(x[:, :, bandw:]) 73 | ], dim=2) 74 | 75 | h = torch.cat([x, aux1], dim=1) 76 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 77 | 78 | h = torch.cat([x, aux1, aux2], dim=1) 79 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 80 | 81 | mask = torch.sigmoid(self.out(h)) 82 | mask = F.pad( 83 | input=mask, 84 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 85 | mode='replicate') 86 | 87 | if self.training: 88 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 89 | aux1 = F.pad( 90 | input=aux1, 91 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 92 | mode='replicate') 93 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 94 | aux2 = F.pad( 95 | input=aux2, 96 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 97 | mode='replicate') 98 | return mask * mix, aux1 * mix, aux2 * mix 99 | else: 100 | if aggressiveness: 101 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 102 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 103 | 104 | return mask * mix 105 | 106 | def predict(self, x_mag, aggressiveness=None): 107 | h = self.forward(x_mag, aggressiveness) 108 | 109 | if self.offset > 0: 110 | h = h[:, :, :, self.offset:-self.offset] 111 | assert h.size()[3] > 0 112 | 113 | return h 114 | -------------------------------------------------------------------------------- /uvr5_pack/lib_v5/nets_61968KB.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | import torch.nn.functional as F 4 | 5 | from uvr5_pack.lib_v5 import layers_123821KB as layers 6 | 7 | 8 | class BaseASPPNet(nn.Module): 9 | 10 | def __init__(self, nin, ch, dilations=(4, 8, 16)): 11 | super(BaseASPPNet, self).__init__() 12 | self.enc1 = layers.Encoder(nin, ch, 3, 2, 1) 13 | self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1) 14 | self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1) 15 | self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1) 16 | 17 | self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations) 18 | 19 | self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1) 20 | self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1) 21 | self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1) 22 | self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1) 23 | 24 | def __call__(self, x): 25 | h, e1 = self.enc1(x) 26 | h, e2 = self.enc2(h) 27 | h, e3 = self.enc3(h) 28 | h, e4 = self.enc4(h) 29 | 30 | h = self.aspp(h) 31 | 32 | h = self.dec4(h, e4) 33 | h = self.dec3(h, e3) 34 | h = self.dec2(h, e2) 35 | h = self.dec1(h, e1) 36 | 37 | return h 38 | 39 | 40 | class CascadedASPPNet(nn.Module): 41 | 42 | def __init__(self, n_fft): 43 | super(CascadedASPPNet, self).__init__() 44 | self.stg1_low_band_net = BaseASPPNet(2, 32) 45 | self.stg1_high_band_net = BaseASPPNet(2, 32) 46 | 47 | self.stg2_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0) 48 | self.stg2_full_band_net = BaseASPPNet(16, 32) 49 | 50 | self.stg3_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0) 51 | self.stg3_full_band_net = BaseASPPNet(32, 64) 52 | 53 | self.out = nn.Conv2d(64, 2, 1, bias=False) 54 | self.aux1_out = nn.Conv2d(32, 2, 1, bias=False) 55 | self.aux2_out = nn.Conv2d(32, 2, 1, bias=False) 56 | 57 | self.max_bin = n_fft // 2 58 | self.output_bin = n_fft // 2 + 1 59 | 60 | self.offset = 128 61 | 62 | def forward(self, x, aggressiveness=None): 63 | mix = x.detach() 64 | x = x.clone() 65 | 66 | x = x[:, :, :self.max_bin] 67 | 68 | bandw = x.size()[2] // 2 69 | aux1 = torch.cat([ 70 | self.stg1_low_band_net(x[:, :, :bandw]), 71 | self.stg1_high_band_net(x[:, :, bandw:]) 72 | ], dim=2) 73 | 74 | h = torch.cat([x, aux1], dim=1) 75 | aux2 = self.stg2_full_band_net(self.stg2_bridge(h)) 76 | 77 | h = torch.cat([x, aux1, aux2], dim=1) 78 | h = self.stg3_full_band_net(self.stg3_bridge(h)) 79 | 80 | mask = torch.sigmoid(self.out(h)) 81 | mask = F.pad( 82 | input=mask, 83 | pad=(0, 0, 0, self.output_bin - mask.size()[2]), 84 | mode='replicate') 85 | 86 | if self.training: 87 | aux1 = torch.sigmoid(self.aux1_out(aux1)) 88 | aux1 = F.pad( 89 | input=aux1, 90 | pad=(0, 0, 0, self.output_bin - aux1.size()[2]), 91 | mode='replicate') 92 | aux2 = torch.sigmoid(self.aux2_out(aux2)) 93 | aux2 = F.pad( 94 | input=aux2, 95 | pad=(0, 0, 0, self.output_bin - aux2.size()[2]), 96 | mode='replicate') 97 | return mask * mix, aux1 * mix, aux2 * mix 98 | else: 99 | if aggressiveness: 100 | mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3) 101 | mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value']) 102 | 103 | return mask * mix 104 | 105 | def predict(self, x_mag, aggressiveness=None): 106 | h = self.forward(x_mag, aggressiveness) 107 | 108 | if self.offset > 0: 109 | h = h[:, :, :, self.offset:-self.offset] 110 | assert h.size()[3] > 0 111 | 112 | return h 113 | -------------------------------------------------------------------------------- /uvr5_pack/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import numpy as np 3 | from tqdm import tqdm 4 | 5 | def make_padding(width, cropsize, offset): 6 | left = offset 7 | roi_size = cropsize - left * 2 8 | if roi_size == 0: 9 | roi_size = cropsize 10 | right = roi_size - (width % roi_size) + left 11 | 12 | return left, right, roi_size 13 | def inference(X_spec, device, model, aggressiveness,data): 14 | ''' 15 | data : dic configs 16 | ''' 17 | 18 | def _execute(X_mag_pad, roi_size, n_window, device, model, aggressiveness,is_half=True): 19 | model.eval() 20 | with torch.no_grad(): 21 | preds = [] 22 | 23 | iterations = [n_window] 24 | 25 | total_iterations = sum(iterations) 26 | for i in tqdm(range(n_window)): 27 | start = i * roi_size 28 | X_mag_window = X_mag_pad[None, :, :, start:start + data['window_size']] 29 | X_mag_window = torch.from_numpy(X_mag_window) 30 | if(is_half==True):X_mag_window=X_mag_window.half() 31 | X_mag_window=X_mag_window.to(device) 32 | 33 | pred = model.predict(X_mag_window, aggressiveness) 34 | 35 | pred = pred.detach().cpu().numpy() 36 | preds.append(pred[0]) 37 | 38 | pred = np.concatenate(preds, axis=2) 39 | return pred 40 | 41 | def preprocess(X_spec): 42 | X_mag = np.abs(X_spec) 43 | X_phase = np.angle(X_spec) 44 | 45 | return X_mag, X_phase 46 | 47 | X_mag, X_phase = preprocess(X_spec) 48 | 49 | coef = X_mag.max() 50 | X_mag_pre = X_mag / coef 51 | 52 | n_frame = X_mag_pre.shape[2] 53 | pad_l, pad_r, roi_size = make_padding(n_frame, 54 | data['window_size'], model.offset) 55 | n_window = int(np.ceil(n_frame / roi_size)) 56 | 57 | X_mag_pad = np.pad( 58 | X_mag_pre, ((0, 0), (0, 0), (pad_l, pad_r)), mode='constant') 59 | 60 | if(list(model.state_dict().values())[0].dtype==torch.float16):is_half=True 61 | else:is_half=False 62 | pred = _execute(X_mag_pad, roi_size, n_window, 63 | device, model, aggressiveness,is_half) 64 | pred = pred[:, :, :n_frame] 65 | 66 | if data['tta']: 67 | pad_l += roi_size // 2 68 | pad_r += roi_size // 2 69 | n_window += 1 70 | 71 | X_mag_pad = np.pad( 72 | X_mag_pre, ((0, 0), (0, 0), (pad_l, pad_r)), mode='constant') 73 | 74 | pred_tta = _execute(X_mag_pad, roi_size, n_window, 75 | device, model, aggressiveness,is_half) 76 | pred_tta = pred_tta[:, :, roi_size // 2:] 77 | pred_tta = pred_tta[:, :, :n_frame] 78 | 79 | return (pred + pred_tta) * 0.5 * coef, X_mag, np.exp(1.j * X_phase) 80 | else: 81 | return pred * coef, X_mag, np.exp(1.j * X_phase) 82 | 83 | 84 | 85 | def _get_name_params(model_path , model_hash): 86 | ModelName = model_path 87 | if model_hash == '47939caf0cfe52a0e81442b85b971dfd': 88 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 89 | param_name_auto=str('4band_44100') 90 | if model_hash == '4e4ecb9764c50a8c414fee6e10395bbe': 91 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2.json') 92 | param_name_auto=str('4band_v2') 93 | if model_hash == 'ca106edd563e034bde0bdec4bb7a4b36': 94 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2.json') 95 | param_name_auto=str('4band_v2') 96 | if model_hash == 'e60a1e84803ce4efc0a6551206cc4b71': 97 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 98 | param_name_auto=str('4band_44100') 99 | if model_hash == 'a82f14e75892e55e994376edbf0c8435': 100 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 101 | param_name_auto=str('4band_44100') 102 | if model_hash == '6dd9eaa6f0420af9f1d403aaafa4cc06': 103 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2_sn.json') 104 | param_name_auto=str('4band_v2_sn') 105 | if model_hash == '08611fb99bd59eaa79ad27c58d137727': 106 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2_sn.json') 107 | param_name_auto=str('4band_v2_sn') 108 | if model_hash == '5c7bbca45a187e81abbbd351606164e5': 109 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json') 110 | param_name_auto=str('3band_44100_msb2') 111 | if model_hash == 'd6b2cb685a058a091e5e7098192d3233': 112 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json') 113 | param_name_auto=str('3band_44100_msb2') 114 | if model_hash == 'c1b9f38170a7c90e96f027992eb7c62b': 115 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 116 | param_name_auto=str('4band_44100') 117 | if model_hash == 'c3448ec923fa0edf3d03a19e633faa53': 118 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 119 | param_name_auto=str('4band_44100') 120 | if model_hash == '68aa2c8093d0080704b200d140f59e54': 121 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100.json') 122 | param_name_auto=str('3band_44100.json') 123 | if model_hash == 'fdc83be5b798e4bd29fe00fe6600e147': 124 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_mid.json') 125 | param_name_auto=str('3band_44100_mid.json') 126 | if model_hash == '2ce34bc92fd57f55db16b7a4def3d745': 127 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_mid.json') 128 | param_name_auto=str('3band_44100_mid.json') 129 | if model_hash == '52fdca89576f06cf4340b74a4730ee5f': 130 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 131 | param_name_auto=str('4band_44100.json') 132 | if model_hash == '41191165b05d38fc77f072fa9e8e8a30': 133 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 134 | param_name_auto=str('4band_44100.json') 135 | if model_hash == '89e83b511ad474592689e562d5b1f80e': 136 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_32000.json') 137 | param_name_auto=str('2band_32000.json') 138 | if model_hash == '0b954da81d453b716b114d6d7c95177f': 139 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_32000.json') 140 | param_name_auto=str('2band_32000.json') 141 | 142 | #v4 Models 143 | if model_hash == '6a00461c51c2920fd68937d4609ed6c8': 144 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr16000_hl512.json') 145 | param_name_auto=str('1band_sr16000_hl512') 146 | if model_hash == '0ab504864d20f1bd378fe9c81ef37140': 147 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json') 148 | param_name_auto=str('1band_sr32000_hl512') 149 | if model_hash == '7dd21065bf91c10f7fccb57d7d83b07f': 150 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json') 151 | param_name_auto=str('1band_sr32000_hl512') 152 | if model_hash == '80ab74d65e515caa3622728d2de07d23': 153 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json') 154 | param_name_auto=str('1band_sr32000_hl512') 155 | if model_hash == 'edc115e7fc523245062200c00caa847f': 156 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json') 157 | param_name_auto=str('1band_sr33075_hl384') 158 | if model_hash == '28063e9f6ab5b341c5f6d3c67f2045b7': 159 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json') 160 | param_name_auto=str('1band_sr33075_hl384') 161 | if model_hash == 'b58090534c52cbc3e9b5104bad666ef2': 162 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json') 163 | param_name_auto=str('1band_sr44100_hl512') 164 | if model_hash == '0cdab9947f1b0928705f518f3c78ea8f': 165 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json') 166 | param_name_auto=str('1band_sr44100_hl512') 167 | if model_hash == 'ae702fed0238afb5346db8356fe25f13': 168 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl1024.json') 169 | param_name_auto=str('1band_sr44100_hl1024') 170 | #User Models 171 | 172 | #1 Band 173 | if '1band_sr16000_hl512' in ModelName: 174 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr16000_hl512.json') 175 | param_name_auto=str('1band_sr16000_hl512') 176 | if '1band_sr32000_hl512' in ModelName: 177 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json') 178 | param_name_auto=str('1band_sr32000_hl512') 179 | if '1band_sr33075_hl384' in ModelName: 180 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json') 181 | param_name_auto=str('1band_sr33075_hl384') 182 | if '1band_sr44100_hl256' in ModelName: 183 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl256.json') 184 | param_name_auto=str('1band_sr44100_hl256') 185 | if '1band_sr44100_hl512' in ModelName: 186 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json') 187 | param_name_auto=str('1band_sr44100_hl512') 188 | if '1band_sr44100_hl1024' in ModelName: 189 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl1024.json') 190 | param_name_auto=str('1band_sr44100_hl1024') 191 | 192 | #2 Band 193 | if '2band_44100_lofi' in ModelName: 194 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_44100_lofi.json') 195 | param_name_auto=str('2band_44100_lofi') 196 | if '2band_32000' in ModelName: 197 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_32000.json') 198 | param_name_auto=str('2band_32000') 199 | if '2band_48000' in ModelName: 200 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_48000.json') 201 | param_name_auto=str('2band_48000') 202 | 203 | #3 Band 204 | if '3band_44100' in ModelName: 205 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100.json') 206 | param_name_auto=str('3band_44100') 207 | if '3band_44100_mid' in ModelName: 208 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_mid.json') 209 | param_name_auto=str('3band_44100_mid') 210 | if '3band_44100_msb2' in ModelName: 211 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json') 212 | param_name_auto=str('3band_44100_msb2') 213 | 214 | #4 Band 215 | if '4band_44100' in ModelName: 216 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json') 217 | param_name_auto=str('4band_44100') 218 | if '4band_44100_mid' in ModelName: 219 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_mid.json') 220 | param_name_auto=str('4band_44100_mid') 221 | if '4band_44100_msb' in ModelName: 222 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_msb.json') 223 | param_name_auto=str('4band_44100_msb') 224 | if '4band_44100_msb2' in ModelName: 225 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_msb2.json') 226 | param_name_auto=str('4band_44100_msb2') 227 | if '4band_44100_reverse' in ModelName: 228 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_reverse.json') 229 | param_name_auto=str('4band_44100_reverse') 230 | if '4band_44100_sw' in ModelName: 231 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_sw.json') 232 | param_name_auto=str('4band_44100_sw') 233 | if '4band_v2' in ModelName: 234 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2.json') 235 | param_name_auto=str('4band_v2') 236 | if '4band_v2_sn' in ModelName: 237 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2_sn.json') 238 | param_name_auto=str('4band_v2_sn') 239 | if 'tmodelparam' in ModelName: 240 | model_params_auto=str('uvr5_pack/lib_v5/modelparams/tmodelparam.json') 241 | param_name_auto=str('User Model Param Set') 242 | return param_name_auto , model_params_auto 243 | -------------------------------------------------------------------------------- /uvr5_weights/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /vc_infer_pipeline.py: -------------------------------------------------------------------------------- 1 | import numpy as np,parselmouth,torch,pdb 2 | from time import time as ttime 3 | import torch.nn.functional as F 4 | from config import x_pad,x_query,x_center,x_max 5 | import scipy.signal as signal 6 | import pyworld,os,traceback,faiss 7 | class VC(object): 8 | def __init__(self,tgt_sr,device,is_half): 9 | self.sr=16000#hubert输入采样率 10 | self.window=160#每帧点数 11 | self.t_pad=self.sr*x_pad#每条前后pad时间 12 | self.t_pad_tgt=tgt_sr*x_pad 13 | self.t_pad2=self.t_pad*2 14 | self.t_query=self.sr*x_query#查询切点前后查询时间 15 | self.t_center=self.sr*x_center#查询切点位置 16 | self.t_max=self.sr*x_max#免查询时长阈值 17 | self.device=device 18 | self.is_half=is_half 19 | 20 | def get_f0(self,x, p_len,f0_up_key,f0_method,inp_f0=None): 21 | time_step = self.window / self.sr * 1000 22 | f0_min = 50 23 | f0_max = 1100 24 | f0_mel_min = 1127 * np.log(1 + f0_min / 700) 25 | f0_mel_max = 1127 * np.log(1 + f0_max / 700) 26 | if(f0_method=="pm"): 27 | f0 = parselmouth.Sound(x, self.sr).to_pitch_ac( 28 | time_step=time_step / 1000, voicing_threshold=0.6, 29 | pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency'] 30 | pad_size=(p_len - len(f0) + 1) // 2 31 | if(pad_size>0 or p_len - len(f0) - pad_size>0): 32 | f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant') 33 | elif(f0_method=="harvest"): 34 | f0, t = pyworld.harvest( 35 | x.astype(np.double), 36 | fs=self.sr, 37 | f0_ceil=f0_max, 38 | frame_period=10, 39 | ) 40 | f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr) 41 | f0 = signal.medfilt(f0, 3) 42 | f0 *= pow(2, f0_up_key / 12) 43 | # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()])) 44 | tf0=self.sr//self.window#每秒f0点数 45 | if (inp_f0 is not None): 46 | delta_t=np.round((inp_f0[:,0].max()-inp_f0[:,0].min())*tf0+1).astype("int16") 47 | replace_f0=np.interp(list(range(delta_t)), inp_f0[:, 0]*100, inp_f0[:, 1]) 48 | shape=f0[x_pad*tf0:x_pad*tf0+len(replace_f0)].shape[0] 49 | f0[x_pad*tf0:x_pad*tf0+len(replace_f0)]=replace_f0[:shape] 50 | # with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()])) 51 | f0bak = f0.copy() 52 | f0_mel = 1127 * np.log(1 + f0 / 700) 53 | f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (f0_mel_max - f0_mel_min) + 1 54 | f0_mel[f0_mel <= 1] = 1 55 | f0_mel[f0_mel > 255] = 255 56 | f0_coarse = np.rint(f0_mel).astype(np.int) 57 | return f0_coarse, f0bak#1-0 58 | 59 | def vc(self,model,net_g,sid,audio0,pitch,pitchf,times,index,big_npy,index_rate):#,file_index,file_big_npy 60 | feats = torch.from_numpy(audio0) 61 | if(self.is_half==True):feats=feats.half() 62 | else:feats=feats.float() 63 | if feats.dim() == 2: # double channels 64 | feats = feats.mean(-1) 65 | assert feats.dim() == 1, feats.dim() 66 | feats = feats.view(1, -1) 67 | padding_mask = torch.BoolTensor(feats.shape).to(self.device).fill_(False) 68 | 69 | inputs = { 70 | "source": feats.to(self.device), 71 | "padding_mask": padding_mask, 72 | "output_layer": 9, # layer 9 73 | } 74 | t0 = ttime() 75 | with torch.no_grad(): 76 | logits = model.extract_features(**inputs) 77 | feats = model.final_proj(logits[0]) 78 | 79 | if(isinstance(index,type(None))==False and isinstance(big_npy,type(None))==False and index_rate!=0): 80 | npy = feats[0].cpu().numpy() 81 | if(self.is_half==True):npy=npy.astype("float32") 82 | _, I = index.search(npy, 1) 83 | npy=big_npy[I.squeeze()] 84 | if(self.is_half==True):npy=npy.astype("float16") 85 | feats = torch.from_numpy(npy).unsqueeze(0).to(self.device)*index_rate + (1-index_rate)*feats 86 | 87 | feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1) 88 | t1 = ttime() 89 | p_len = audio0.shape[0]//self.window 90 | if(feats.shape[1]self.t_max): 121 | audio_sum = np.zeros_like(audio) 122 | for i in range(self.window): audio_sum += audio_pad[i:i - self.window] 123 | for t in range(self.t_center, audio.shape[0],self.t_center):opt_ts.append(t - self.t_query + np.where(np.abs(audio_sum[t - self.t_query:t + self.t_query]) == np.abs(audio_sum[t - self.t_query:t + self.t_query]).min())[0][0]) 124 | s = 0 125 | audio_opt=[] 126 | t=None 127 | t1=ttime() 128 | audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode='reflect') 129 | p_len=audio_pad.shape[0]//self.window 130 | inp_f0=None 131 | if(hasattr(f0_file,'name') ==True): 132 | try: 133 | with open(f0_file.name,"r")as f: 134 | lines=f.read().strip("\n").split("\n") 135 | inp_f0=[] 136 | for line in lines:inp_f0.append([float(i)for i in line.split(",")]) 137 | inp_f0=np.array(inp_f0,dtype="float32") 138 | except: 139 | traceback.print_exc() 140 | sid=torch.tensor(sid,device=self.device).unsqueeze(0).long() 141 | pitch, pitchf=None,None 142 | if(if_f0==1): 143 | pitch, pitchf = self.get_f0(audio_pad, p_len, f0_up_key,f0_method,inp_f0) 144 | pitch = pitch[:p_len] 145 | pitchf = pitchf[:p_len] 146 | pitch = torch.tensor(pitch,device=self.device).unsqueeze(0).long() 147 | pitchf = torch.tensor(pitchf,device=self.device).unsqueeze(0).float() 148 | t2=ttime() 149 | times[1] += (t2 - t1) 150 | for t in opt_ts: 151 | t=t//self.window*self.window 152 | if (if_f0 == 1): 153 | audio_opt.append(self.vc(model,net_g,sid,audio_pad[s:t+self.t_pad2+self.window],pitch[:,s//self.window:(t+self.t_pad2)//self.window],pitchf[:,s//self.window:(t+self.t_pad2)//self.window],times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt]) 154 | else: 155 | audio_opt.append(self.vc(model,net_g,sid,audio_pad[s:t+self.t_pad2+self.window],None,None,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt]) 156 | s = t 157 | if (if_f0 == 1): 158 | audio_opt.append(self.vc(model,net_g,sid,audio_pad[t:],pitch[:,t//self.window:]if t is not None else pitch,pitchf[:,t//self.window:]if t is not None else pitchf,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt]) 159 | else: 160 | audio_opt.append(self.vc(model,net_g,sid,audio_pad[t:],None,None,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt]) 161 | audio_opt=np.concatenate(audio_opt) 162 | del pitch,pitchf,sid 163 | if torch.cuda.is_available(): torch.cuda.empty_cache() 164 | return audio_opt 165 | -------------------------------------------------------------------------------- /weights/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /使用需遵守的协议-LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 liujing04 4 | Copyright (c) 2023 源文雨 5 | 6 | 本软件及其相关代码以MIT协议开源,作者不对软件具备任何控制力,使用软件者、传播软件导出的声音者自负全责。 7 | 如不认可该条款,则不能使用或引用软件包内任何代码和文件。 8 | 9 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 14 | 15 | 特此授予任何获得本软件和相关文档文件(以下简称“软件”)副本的人免费使用、复制、修改、合并、出版、分发、再授权和/或销售本软件的权利,以及授予本软件所提供的人使用本软件的权利,但须符合以下条件: 16 | 上述版权声明和本许可声明应包含在软件的所有副本或实质部分中。 17 | 软件是“按原样”提供的,没有任何明示或暗示的保证,包括但不限于适销性、适用于特定目的和不侵权的保证。在任何情况下,作者或版权持有人均不承担因软件或软件的使用或其他交易而产生、产生或与之相关的任何索赔、损害赔偿或其他责任,无论是在合同诉讼、侵权诉讼还是其他诉讼中。 18 | 19 | 相关引用库协议如下: 20 | ################# 21 | ContentVec 22 | https://github.com/auspicious3000/contentvec/blob/main/LICENSE 23 | MIT License 24 | ################# 25 | VITS 26 | https://github.com/jaywalnut310/vits/blob/main/LICENSE 27 | MIT License 28 | ################# 29 | HIFIGAN 30 | https://github.com/jik876/hifi-gan/blob/master/LICENSE 31 | MIT License 32 | ################# 33 | gradio 34 | https://github.com/gradio-app/gradio/blob/main/LICENSE 35 | Apache License 2.0 36 | ################# 37 | ffmpeg 38 | https://github.com/FFmpeg/FFmpeg/blob/master/COPYING.LGPLv3 39 | https://github.com/BtbN/FFmpeg-Builds/releases/download/autobuild-2021-02-28-12-32/ffmpeg-n4.3.2-160-gfbb9368226-win64-lgpl-4.3.zip 40 | LPGLv3 License 41 | MIT License 42 | ################# 43 | ultimatevocalremovergui 44 | https://github.com/Anjok07/ultimatevocalremovergui/blob/master/LICENSE 45 | https://github.com/yang123qwe/vocal_separation_by_uvr5 46 | MIT License 47 | ################# 48 | audio-slicer 49 | https://github.com/openvpi/audio-slicer/blob/main/LICENSE 50 | MIT License 51 | -------------------------------------------------------------------------------- /小白简易教程.doc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/小白简易教程.doc --------------------------------------------------------------------------------