├── .gitignore
├── Changelog_CN.md
├── LICENSE
├── README.md
├── README_en.md
├── RVC改进意见.txt
├── Retrieval_based_Voice_Conversion_WebUI.ipynb
├── config.py
├── configs
    ├── 32k.json
    ├── 40k.json
    └── 48k.json
├── envfilescheck.bat
├── export_onnx.py
├── extract_f0_print.py
├── extract_feature_print.py
├── go-web_jp.bat
├── gui.py
├── infer-web.py
├── infer-web_jp.py
├── infer
    ├── infer-pm-index256.py
    ├── train-index.py
    └── trans_weights.py
├── infer_pack
    ├── attentions.py
    ├── commons.py
    ├── models.py
    ├── models_onnx.py
    ├── modules.py
    └── transforms.py
├── infer_uvr5.py
├── logs
    └── mute
    │   ├── 0_gt_wavs
    │       ├── mute32k.wav
    │       ├── mute40k.wav
    │       └── mute48k.wav
    │   ├── 1_16k_wavs
    │       └── mute.wav
    │   ├── 2a_f0
    │       └── mute.wav.npy
    │   ├── 2b-f0nsf
    │       └── mute.wav.npy
    │   └── 3_feature256
    │       └── mute.npy
├── my_utils.py
├── poetry.lock
├── pretrained
    └── .gitignore
├── pyproject.toml
├── requirements.txt
├── slicer2.py
├── train
    ├── cmd.txt
    ├── data_utils.py
    ├── losses.py
    ├── mel_processing.py
    ├── process_ckpt.py
    └── utils.py
├── train_nsf_sim_cache_sid_load_pretrain.py
├── trainset_preprocess_pipeline_print.py
├── uvr5_pack
    ├── lib_v5
    │   ├── dataset.py
    │   ├── layers.py
    │   ├── layers_123812KB .py
    │   ├── layers_123821KB.py
    │   ├── layers_33966KB.py
    │   ├── layers_537227KB.py
    │   ├── layers_537238KB.py
    │   ├── model_param_init.py
    │   ├── modelparams
    │   │   ├── 1band_sr16000_hl512.json
    │   │   ├── 1band_sr32000_hl512.json
    │   │   ├── 1band_sr33075_hl384.json
    │   │   ├── 1band_sr44100_hl1024.json
    │   │   ├── 1band_sr44100_hl256.json
    │   │   ├── 1band_sr44100_hl512.json
    │   │   ├── 1band_sr44100_hl512_cut.json
    │   │   ├── 2band_32000.json
    │   │   ├── 2band_44100_lofi.json
    │   │   ├── 2band_48000.json
    │   │   ├── 3band_44100.json
    │   │   ├── 3band_44100_mid.json
    │   │   ├── 3band_44100_msb2.json
    │   │   ├── 4band_44100.json
    │   │   ├── 4band_44100_mid.json
    │   │   ├── 4band_44100_msb.json
    │   │   ├── 4band_44100_msb2.json
    │   │   ├── 4band_44100_reverse.json
    │   │   ├── 4band_44100_sw.json
    │   │   ├── 4band_v2.json
    │   │   ├── 4band_v2_sn.json
    │   │   └── ensemble.json
    │   ├── nets.py
    │   ├── nets_123812KB.py
    │   ├── nets_123821KB.py
    │   ├── nets_33966KB.py
    │   ├── nets_537227KB.py
    │   ├── nets_537238KB.py
    │   ├── nets_61968KB.py
    │   └── spec_utils.py
    └── utils.py
├── uvr5_weights
    └── .gitignore
├── vc_infer_pipeline.py
├── weights
    └── .gitignore
├── 使用需遵守的协议-LICENSE.txt
└── 小白简易教程.doc


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | __pycache__
3 | /TEMP
4 | *.pyd
5 | hubert_base.pt
6 | /logs
7 | 


--------------------------------------------------------------------------------
/Changelog_CN.md:
--------------------------------------------------------------------------------
 1 | 20230409
 2 | 
 3 | 修正训练参数，提升显卡平均利用率，A100最高从25%提升至90%左右，V100:50%->90%左右，2060S:60%->85%左右，P40:25%->95%左右，训练速度显著提升
 4 | 
 5 | 修正参数：总batch_size改为每张卡的batch_size
 6 | 
 7 | 修正total_epoch：最大限制100解锁至1000；默认10提升至默认20
 8 | 
 9 | 修复ckpt提取识别是否带音高错误导致推理异常的问题
10 | 
11 | 修复分布式训练每个rank都保存一次ckpt的问题
12 | 
13 | 特征提取进行nan特征过滤
14 | 
15 | 修复静音输入输出随机辅音or噪声的问题（老版模型需要重做训练集重训）
16 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 liujing04
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | 
  3 | <h1>Retrieval-based-Voice-Conversion-WebUI</h1>
  4 | 一个基于VITS的简单易用的语音转换（变声器）框架<br><br>
  5 | 
  6 | [![madewithlove](https://forthebadge.com/images/badges/built-with-love.svg)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI)
  7 | 
  8 | <img src="https://counter.seku.su/cmoe?name=rvc&theme=r34" /><br>
  9 | 
 10 | [![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/Retrieval_based_Voice_Conversion_WebUI.ipynb)
 11 | [![Licence](https://img.shields.io/github/license/liujing04/Retrieval-based-Voice-Conversion-WebUI?style=for-the-badge)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/%E4%BD%BF%E7%94%A8%E9%9C%80%E9%81%B5%E5%AE%88%E7%9A%84%E5%8D%8F%E8%AE%AE-LICENSE.txt)
 12 | [![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-blue.svg?style=for-the-badge)](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)
 13 | 
 14 | </div>
 15 | 
 16 | ------
 17 | 
 18 | [**更新日志**](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/Changelog_CN.md)
 19 | 
 20 | [**English**](./README_en.md) | [**中文简体**](./README.md)
 21 | 
 22 | > 点此查看我们的[演示视频](https://www.bilibili.com/video/BV1pm4y1z7Gm/) !
 23 | 
 24 | > 使用了RVC的实时语音转换: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
 25 | 
 26 | ## 简介
 27 | 本仓库具有以下特点:
 28 | + 使用top1特征模型检索来杜绝音色泄漏；
 29 | + 即便在相对较差的显卡上也能快速训练;
 30 | + 使用少量数据进行训练也能得到较好结果;
 31 | + 可以通过模型融合来改变音色;
 32 | + 简单易用的WebUI界面;
 33 | + 可调用UVR5模型来快速分离人声和伴奏。
 34 | + 底模训练集使用接近50小时的高质量VCTK开源，后续会陆续加入高质量有授权歌声训练集供大家放心使用。
 35 | ## 环境配置
 36 | 我们推荐你使用poetry来配置环境。
 37 | 
 38 | 以下指令需在Python版本大于3.8的环境当中执行:
 39 | ```bash
 40 | # 安装Pytorch及其核心依赖，若已安装则跳过
 41 | # 参考自: https://pytorch.org/get-started/locally/
 42 | pip install torch torchvision torchaudio
 43 | 
 44 | 如果是win系统+30系显卡，根据https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21的经验，需要指定pytorch对应的cuda版本
 45 | 
 46 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
 47 | 
 48 | # 安装 Poetry 依赖管理工具, 若已安装则跳过
 49 | # 参考自: https://python-poetry.org/docs/#installation
 50 | curl -sSL https://install.python-poetry.org | python3 -
 51 | 
 52 | # 通过poetry安装依赖
 53 | poetry install
 54 | ```
 55 | 
 56 | 你也可以通过pip来安装依赖：
 57 | 
 58 | **注意**: `MacOS`下`faiss 1.7.2`版本会导致抛出段错误，请将`requirements.txt`的对应条目改为`faiss-cpu==1.7.0`
 59 | 
 60 | ```bash
 61 | pip install -r requirements.txt
 62 | ```
 63 | 
 64 | ## 其他预模型准备
 65 | RVC需要其他的一些预模型来推理和训练。
 66 | 
 67 | 你可以从我们的[Huggingface space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)下载到这些模型。
 68 | 
 69 | 以下是一份清单，包括了所有RVC所需的预模型和其他文件的名称:
 70 | ```bash
 71 | hubert_base.pt
 72 | 
 73 | ./pretrained 
 74 | 
 75 | ./uvr5_weights
 76 | 
 77 | #如果你正在使用Windows，则你可能需要这个文件夹，若FFmpeg已安装则跳过
 78 | ./ffmpeg
 79 | ```
 80 | 之后使用以下指令来调用Webui:
 81 | ```bash
 82 | python infer-web.py
 83 | ```
 84 | 如果你正在使用Windows，你可以直接下载并解压`RVC-beta.7z` 来使用RVC，运行`go-web.bat`来启动WebUI。
 85 | 
 86 | 我们将在两周内推出一个英文版本的WebUI.
 87 | 
 88 | 仓库内还有一份`小白简易教程.doc`以供参考。
 89 | 
 90 | ## 参考项目
 91 | + [ContentVec](https://github.com/auspicious3000/contentvec/)
 92 | + [VITS](https://github.com/jaywalnut310/vits)
 93 | + [HIFIGAN](https://github.com/jik876/hifi-gan)
 94 | + [Gradio](https://github.com/gradio-app/gradio)
 95 | + [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 96 | + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
 97 | + [audio-slicer](https://github.com/openvpi/audio-slicer)
 98 | ## 感谢所有贡献者作出的努力
 99 | <a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
100 |   <img src="https://contrib.rocks/image?repo=liujing04/Retrieval-based-Voice-Conversion-WebUI" />
101 | </a>
102 | 
103 | 


--------------------------------------------------------------------------------
/README_en.md:
--------------------------------------------------------------------------------
 1 | # Retrieval-based-Voice-Conversion-WebUI
 2 | 
 3 | [![madewithlove](https://forthebadge.com/images/badges/built-with-love.svg)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI)
 4 | 
 5 | [![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/Retrieval_based_Voice_Conversion_WebUI.ipynb)
 6 | [![Licence](https://img.shields.io/github/license/liujing04/Retrieval-based-Voice-Conversion-WebUI?style=for-the-badge)](https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/blob/main/%E4%BD%BF%E7%94%A8%E9%9C%80%E9%81%B5%E5%AE%88%E7%9A%84%E5%8D%8F%E8%AE%AE-LICENSE.txt)
 7 | [![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-blue.svg?style=for-the-badge)](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)
 8 | 
 9 | ### Realtime Voice Conversion Software using RVC : [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
10 | ------
11 | 
12 | An easy-to-use SVC framework based on VITS.
13 | 
14 | [**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
15 | 
16 | > Check our [Demo Video](https://www.bilibili.com/video/BV1pm4y1z7Gm/) here!
17 | ## Summary
18 | This repository has the following features:
19 | + Using top1 feature model retrieval to reduce tone leakage;
20 | + Easy and fast training, even on relatively poor graphics cards;
21 | + Training with a small amount of data also obtains relatively good results;
22 | + Supporting model fusion to change timbres;
23 | + Easy-to-use Webui interface;
24 | + Use the UVR5 model to quickly separate vocals and instruments.
25 | + The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source, and high quality licensed song datasets will be added one after another for your use, without worrying about copyright infringement.
26 | ## Preparing the environment
27 | We recommend you install the dependencies through poetry.
28 | 
29 | The following commands need to be executed in the environment of Python version 3.8 or higher:
30 | ```bash
31 | # Install PyTorch-related core dependencies, skip if installed
32 | # Reference: https://pytorch.org/get-started/locally/
33 | pip install torch torchvision torchaudio
34 | 
35 | #For Windows + 30-series Nvidia cards, you need to specify the cuda version corresponding to pytorch according to the experience of https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/issues/21
36 | 
37 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
38 | 
39 | # Install the Poetry dependency management tool, skip if installed
40 | # Reference: https://python-poetry.org/docs/#installation
41 | curl -sSL https://install.python-poetry.org | python3 -
42 | 
43 | # Install the project dependencies
44 | poetry install
45 | ```
46 | You can also use pip to install the dependencies
47 | 
48 | **Notice**: `faiss 1.7.2` will raise Segmentation Fault: 11 under `MacOS`, please change corresponding line in `requirements.txt` to `faiss-cpu==1.7.0`
49 | 
50 | ```bash
51 | pip install -r requirements.txt
52 | ```
53 | 
54 | ## Preparation of other Pre-models
55 | RVC requires other pre-models to infer and train.
56 | 
57 | You need to download them from our [Huggingface space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/).
58 | 
59 | Here's a list of Pre-models and other files that RVC needs:
60 | ```bash
61 | hubert_base.pt
62 | 
63 | ./pretrained 
64 | 
65 | ./uvr5_weights
66 | 
67 | #If you are using Windows, you may also need this dictionary, skip if FFmpeg is installed
68 | ffmpeg.exe
69 | ```
70 | Then use this command to start Webui:
71 | ```bash
72 | python infer-web.py
73 | ```
74 | If you are using Windows, you can download and extract `RVC-beta.7z` to use RVC directly and use `go-web.bat` to start Webui.
75 | 
76 | We will develop an English version of the WebUI in 2 weeks.
77 | 
78 | There's also a tutorial on RVC in Chinese and you can check it out if needed.
79 | 
80 | ## Credits
81 | 
82 | ## Thanks to all contributors for their efforts
83 | 
84 | <a href="https://github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
85 |   <img src="https://contrib.rocks/image?repo=liujing04/Retrieval-based-Voice-Conversion-WebUI" />
86 | </a>
87 | 
88 | 


--------------------------------------------------------------------------------
/RVC改进意见.txt:
--------------------------------------------------------------------------------
 1 | ToDo：
 2 | 
 3 | 停车按钮
 4 | 
 5 | 根据每E时间推测训练剩余时间
 6 | 
 7 | 记录点Demo:
 8 | 推理时可以选择哪些记录点，然后批量自动推理出demo以便对比节点过拟合和欠拟合情况
 9 | 训练时可以自动推理每个保存节点的Demo便于实时听过拟合和欠拟合[可单独选择一张推理用卡]
10 | 
11 | 训练队列:
12 | 可以队列训练列表，训练结束后自动进行下一个训练
13 | 
14 | 配置文件保存:
15 | WebUI的预设可以保存为配置文件，下次启动时自动读取
16 | 
17 | 推理自动选择特征库检索文件
18 | 
19 | Epoch和保存频率、Batch size等可以从滑条改为一个纵向的输入数字的配置面板
20 | 
21 | WebUI可以重新布局？ 详情参考目录下的WebUI_参考(目前尚未建立)
22 | 
23 | 模型推理可以做成单次拖拽类的
24 | 
25 | 
26 | 个人的小想法:
27 | 可以试着接入一些类似于Vocaloid的工程文件来读取F0音高曲线？
28 | 比如SV,ACE,Vocaloid,Cevio Studio这种歌声合成软件
29 | 然后再给到f0编辑器(如果有了)
30 | 
31 | 能暴露接口然后可以用QT做个桌面程序？毕竟QT也是跨平台的
32 | 可以给到一个端口，让他们在云端跑，本地跑这个QT程序桌面程序来控制云端的训练和推理？
33 | 
34 | 
35 | 
36 | IsDo:


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | parser = argparse.ArgumentParser()
 3 | parser.add_argument("--port", type=int, default=7865, help="Listen port")
 4 | parser.add_argument("--pycmd", type=str, default="python", help="Python command")
 5 | parser.add_argument("--colab", action='store_true', help="Launch in colab")
 6 | parser.add_argument("--noparallel", action='store_true', help="Disable parallel processing")
 7 | cmd_opts = parser.parse_args()
 8 | ############离线VC参数
 9 | inp_root=r"白鹭霜华长条"#对输入目录下所有音频进行转换，别放非音频文件
10 | opt_root=r"opt"#输出目录
11 | f0_up_key=0#升降调，整数，男转女12，女转男-12
12 | person=r"weights\洛天依v3.pt"#目前只有洛天依v3
13 | ############硬件参数
14 | device = "cuda:0"#填写cuda:x或cpu，x指代第几张卡，只支持N卡加速
15 | is_half=True#9-10-20-30-40系显卡无脑True，不影响质量，>=20显卡开启有加速
16 | n_cpu=0#默认0用上所有线程，写数字限制CPU资源使用
17 | ############python命令路径
18 | python_cmd=cmd_opts.pycmd
19 | listen_port=cmd_opts.port
20 | iscolab=cmd_opts.colab
21 | noparallel=cmd_opts.noparallel
22 | ############下头别动
23 | import torch
24 | if(torch.cuda.is_available()==False):
25 |     print("没有发现支持的N卡, 使用CPU进行推理")
26 |     device="cpu"
27 |     is_half=False
28 | if(device!="cpu"):
29 |     gpu_name=torch.cuda.get_device_name(int(device.split(":")[-1]))
30 |     if("16"in gpu_name or "MX"in gpu_name):
31 |         print("16系显卡/MX系显卡强制单精度")
32 |         is_half=False
33 | from multiprocessing import cpu_count
34 | if(n_cpu==0):n_cpu=cpu_count()
35 | if(is_half==True):
36 |     #6G显存配置
37 |     x_pad       =   3
38 |     x_query     =   10
39 |     x_center    =   60
40 |     x_max       =   65
41 | else:
42 |     #5G显存配置
43 |     x_pad       =   1
44 |     # x_query     =   6
45 |     # x_center    =   30
46 |     # x_max       =   32
47 |     #6G显存配置
48 |     x_query     =   6
49 |     x_center    =   38
50 |     x_max       =   41
51 | 


--------------------------------------------------------------------------------
/configs/32k.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "train": {
 3 |     "log_interval": 200,
 4 |     "seed": 1234,
 5 |     "epochs": 20000,
 6 |     "learning_rate": 1e-4,
 7 |     "betas": [0.8, 0.99],
 8 |     "eps": 1e-9,
 9 |     "batch_size": 4,
10 |     "fp16_run": true,
11 |     "lr_decay": 0.999875,
12 |     "segment_size": 12800,
13 |     "init_lr_ratio": 1,
14 |     "warmup_epochs": 0,
15 |     "c_mel": 45,
16 |     "c_kl": 1.0
17 |   },
18 |   "data": {
19 |     "max_wav_value": 32768.0,
20 |     "sampling_rate": 32000,
21 |     "filter_length": 1024,
22 |     "hop_length": 320,
23 |     "win_length": 1024,
24 |     "n_mel_channels": 80,
25 |     "mel_fmin": 0.0,
26 |     "mel_fmax": null
27 |   },
28 |   "model": {
29 |     "inter_channels": 192,
30 |     "hidden_channels": 192,
31 |     "filter_channels": 768,
32 |     "n_heads": 2,
33 |     "n_layers": 6,
34 |     "kernel_size": 3,
35 |     "p_dropout": 0,
36 |     "resblock": "1",
37 |     "resblock_kernel_sizes": [3,7,11],
38 |     "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
39 |     "upsample_rates": [10,4,2,2,2],
40 |     "upsample_initial_channel": 512,
41 |     "upsample_kernel_sizes": [16,16,4,4,4],
42 |     "use_spectral_norm": false,
43 |     "gin_channels": 256,
44 |     "spk_embed_dim": 109
45 |   }
46 | }
47 | 


--------------------------------------------------------------------------------
/configs/40k.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "train": {
 3 |     "log_interval": 200,
 4 |     "seed": 1234,
 5 |     "epochs": 20000,
 6 |     "learning_rate": 1e-4,
 7 |     "betas": [0.8, 0.99],
 8 |     "eps": 1e-9,
 9 |     "batch_size": 4,
10 |     "fp16_run": true,
11 |     "lr_decay": 0.999875,
12 |     "segment_size": 12800,
13 |     "init_lr_ratio": 1,
14 |     "warmup_epochs": 0,
15 |     "c_mel": 45,
16 |     "c_kl": 1.0
17 |   },
18 |   "data": {
19 |     "max_wav_value": 32768.0,
20 |     "sampling_rate": 40000,
21 |     "filter_length": 2048,
22 |     "hop_length": 400,
23 |     "win_length": 2048,
24 |     "n_mel_channels": 125,
25 |     "mel_fmin": 0.0,
26 |     "mel_fmax": null
27 |   },
28 |   "model": {
29 |     "inter_channels": 192,
30 |     "hidden_channels": 192,
31 |     "filter_channels": 768,
32 |     "n_heads": 2,
33 |     "n_layers": 6,
34 |     "kernel_size": 3,
35 |     "p_dropout": 0,
36 |     "resblock": "1",
37 |     "resblock_kernel_sizes": [3,7,11],
38 |     "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
39 |     "upsample_rates": [10,10,2,2],
40 |     "upsample_initial_channel": 512,
41 |     "upsample_kernel_sizes": [16,16,4,4],
42 |     "use_spectral_norm": false,
43 |     "gin_channels": 256,
44 |     "spk_embed_dim": 109
45 |   }
46 | }
47 | 


--------------------------------------------------------------------------------
/configs/48k.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "train": {
 3 |     "log_interval": 200,
 4 |     "seed": 1234,
 5 |     "epochs": 20000,
 6 |     "learning_rate": 1e-4,
 7 |     "betas": [0.8, 0.99],
 8 |     "eps": 1e-9,
 9 |     "batch_size": 4,
10 |     "fp16_run": true,
11 |     "lr_decay": 0.999875,
12 |     "segment_size": 11520,
13 |     "init_lr_ratio": 1,
14 |     "warmup_epochs": 0,
15 |     "c_mel": 45,
16 |     "c_kl": 1.0
17 |   },
18 |   "data": {
19 |     "max_wav_value": 32768.0,
20 |     "sampling_rate": 48000,
21 |     "filter_length": 2048,
22 |     "hop_length": 480,
23 |     "win_length": 2048,
24 |     "n_mel_channels": 128,
25 |     "mel_fmin": 0.0,
26 |     "mel_fmax": null
27 |   },
28 |   "model": {
29 |     "inter_channels": 192,
30 |     "hidden_channels": 192,
31 |     "filter_channels": 768,
32 |     "n_heads": 2,
33 |     "n_layers": 6,
34 |     "kernel_size": 3,
35 |     "p_dropout": 0,
36 |     "resblock": "1",
37 |     "resblock_kernel_sizes": [3,7,11],
38 |     "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
39 |     "upsample_rates": [10,6,2,2,2],
40 |     "upsample_initial_channel": 512,
41 |     "upsample_kernel_sizes": [16,16,4,4,4],
42 |     "use_spectral_norm": false,
43 |     "gin_channels": 256,
44 |     "spk_embed_dim": 109
45 |   }
46 | }
47 | 


--------------------------------------------------------------------------------
/envfilescheck.bat:
--------------------------------------------------------------------------------
  1 | @echo off && chcp 65001
  2 | 
  3 | echo working dir is %cd%
  4 | echo downloading requirement aria2 check.
  5 | echo=
  6 | dir /a:d/b | findstr "aria2" > flag.txt
  7 | findstr "aria2" flag.txt >nul
  8 | if %errorlevel% ==0 (
  9 |     echo aria2 checked.
 10 |     echo=
 11 | ) else (
 12 |     echo failed. please downloading aria2 from webpage!
 13 |     echo unzip it and put in this directory!
 14 |     timeout /T 5
 15 |     start https://github.com/aria2/aria2/releases/tag/release-1.36.0
 16 |     echo=
 17 |     goto end
 18 | )
 19 | 
 20 | echo envfiles checking start.
 21 | echo=
 22 | 
 23 | for /f %%x in ('findstr /i /c:"aria2" "flag.txt"') do (set aria2=%%x)&goto endSch
 24 | :endSch
 25 | 
 26 | set d32=f0D32k.pth
 27 | set d40=f0D40k.pth
 28 | set d48=f0D48k.pth
 29 | set g32=f0G32k.pth
 30 | set g40=f0G40k.pth
 31 | set g48=f0G48k.pth
 32 | 
 33 | set dld32=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D32k.pth
 34 | set dld40=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D40k.pth
 35 | set dld48=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0D48k.pth
 36 | set dlg32=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G32k.pth
 37 | set dlg40=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G40k.pth
 38 | set dlg48=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/f0G48k.pth
 39 | 
 40 | set hp2=HP2-人声vocals+非人声instrumentals.pth
 41 | set hp5=HP5-主旋律人声vocals+其他instrumentals.pth
 42 | 
 43 | set dlhp2=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/uvr5_weights/HP2-人声vocals+非人声instrumentals.pth
 44 | set dlhp5=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/uvr5_weights/HP5-主旋律人声vocals+其他instrumentals.pth
 45 | 
 46 | set hb=hubert_base.pt
 47 | 
 48 | set dlhb=https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt
 49 | 
 50 | echo dir check start.
 51 | echo=
 52 | 
 53 | if exist "%~dp0pretrained" (
 54 |         echo dir .\pretrained checked.
 55 |     ) else (
 56 |         echo failed. generating dir .\pretrained.
 57 |         mkdir pretrained
 58 |     )
 59 | if exist "%~dp0uvr5_weights" (
 60 |         echo dir .\uvr5_weights checked.
 61 |     ) else (
 62 |         echo failed. generating dir .\uvr5_weights.
 63 |         mkdir uvr5_weights
 64 |     )
 65 | 
 66 | echo=
 67 | echo dir check finished.
 68 | 
 69 | echo=
 70 | echo required files check start.
 71 | 
 72 | echo checking D32k.pth
 73 | if exist "%~dp0pretrained\D32k.pth" (
 74 |         echo D32k.pth in .\pretrained checked.
 75 |         echo=
 76 |     ) else (
 77 |         echo failed. starting download from huggingface.
 78 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D32k.pth -d %~dp0pretrained -o D32k.pth
 79 |         if exist "%~dp0pretrained\D32k.pth" (echo download successful.) else (echo please try again!
 80 |         echo=)
 81 |     )
 82 | echo checking D40k.pth
 83 | if exist "%~dp0pretrained\D40k.pth" (
 84 |         echo D40k.pth in .\pretrained checked.
 85 |         echo=
 86 |     ) else (
 87 |         echo failed. starting download from huggingface.
 88 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D40k.pth -d %~dp0pretrained -o D40k.pth
 89 |         if exist "%~dp0pretrained\D40k.pth" (echo download successful.) else (echo please try again!
 90 |         echo=)
 91 |     )
 92 |     echo checking D48k.pth
 93 | if exist "%~dp0pretrained\D48k.pth" (
 94 |         echo D48k.pth in .\pretrained checked.
 95 |         echo=
 96 |     ) else (
 97 |         echo failed. starting download from huggingface.
 98 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/D48k.pth -d %~dp0pretrained -o D48k.pth
 99 |         if exist "%~dp0pretrained\D48k.pth" (echo download successful.) else (echo please try again!
100 |         echo=)
101 |     )
102 |     echo checking G32k.pth
103 | if exist "%~dp0pretrained\G32k.pth" (
104 |         echo G32k.pth in .\pretrained checked.
105 |         echo=
106 |     ) else (
107 |         echo failed. starting download from huggingface.
108 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G32k.pth -d %~dp0pretrained -o G32k.pth
109 |         if exist "%~dp0pretrained\G32k.pth" (echo download successful.) else (echo please try again!
110 |         echo=)
111 |     )
112 |     echo checking G40k.pth
113 | if exist "%~dp0pretrained\G40k.pth" (
114 |         echo G40k.pth in .\pretrained checked.
115 |         echo=
116 |     ) else (
117 |         echo failed. starting download from huggingface.
118 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G40k.pth -d %~dp0pretrained -o G40k.pth
119 |         if exist "%~dp0pretrained\G40k.pth" (echo download successful.) else (echo please try again!
120 |         echo=)
121 |     )
122 |     echo checking G48k.pth
123 | if exist "%~dp0pretrained\G48k.pth" (
124 |         echo G48k.pth in .\pretrained checked.
125 |         echo=
126 |     ) else (
127 |         echo failed. starting download from huggingface.
128 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained/G48k.pth -d %~dp0pretrained -o G48k.pth
129 |         if exist "%~dp0pretrained\G48k.pth" (echo download successful.) else (echo please try again!
130 |         echo=)
131 |     )
132 | 
133 | echo checking %d32%
134 | if exist "%~dp0pretrained\%d32%" (
135 |         echo %d32% in .\pretrained checked.
136 |         echo=
137 |     ) else (
138 |         echo failed. starting download from huggingface.
139 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dld32% -d %~dp0pretrained -o %d32%
140 |         if exist "%~dp0pretrained\%d32%" (echo download successful.) else (echo please try again!
141 |         echo=)
142 |     )
143 | echo checking %d40%
144 | if exist "%~dp0pretrained\%d40%" (
145 |         echo %d40% in .\pretrained checked.
146 |         echo=
147 |     ) else (
148 |         echo failed. starting download from huggingface.
149 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dld40% -d %~dp0pretrained -o %d40%
150 |         if exist "%~dp0pretrained\%d40%" (echo download successful.) else (echo please try again!
151 |         echo=)
152 |     )
153 | echo checking %d48%
154 | if exist "%~dp0pretrained\%d48%" (
155 |         echo %d48% in .\pretrained checked.
156 |         echo=
157 |     ) else (
158 |         echo failed. starting download from huggingface.
159 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dld48% -d %~dp0pretrained -o %d48%
160 |         if exist "%~dp0pretrained\%d48%" (echo download successful.) else (echo please try again!
161 |         echo=)
162 |     )
163 | echo checking %g32%
164 | if exist "%~dp0pretrained\%g32%" (
165 |         echo %g32% in .\pretrained checked.
166 |         echo=
167 |     ) else (
168 |         echo failed. starting download from huggingface.
169 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlg32% -d %~dp0pretrained -o %g32%
170 |         if exist "%~dp0pretrained\%g32%" (echo download successful.) else (echo please try again!
171 |         echo=)
172 |     )
173 | echo checking %g40%
174 | if exist "%~dp0pretrained\%g40%" (
175 |         echo %g40% in .\pretrained checked.
176 |         echo=
177 |     ) else (
178 |         echo failed. starting download from huggingface.
179 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlg40% -d %~dp0pretrained -o %g40%
180 |         if exist "%~dp0pretrained\%g40%" (echo download successful.) else (echo please try again!
181 |         echo=)
182 |     )
183 | echo checking %g48%
184 | if exist "%~dp0pretrained\%g48%" (
185 |         echo %g48% in .\pretrained checked.
186 |         echo=
187 |     ) else (
188 |         echo failed. starting download from huggingface.
189 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlg48% -d %~dp0\pretrained -o %g48%
190 |         if exist "%~dp0pretrained\%g48%" (echo download successful.) else (echo please try again!
191 |         echo=)
192 |     )
193 | 
194 | echo checking %hp2%
195 | if exist "%~dp0uvr5_weights\%hp2%" (
196 |         echo %hp2% in .\uvr5_weights checked.
197 |         echo=
198 |     ) else (
199 |         echo failed. starting download from huggingface.
200 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlhp2% -d %~dp0\uvr5_weights -o %hp2%
201 |         if exist "%~dp0uvr5_weights\%hp2%" (echo download successful.) else (echo please try again!
202 |         echo=)
203 |     )
204 | echo checking %hp5%
205 | if exist "%~dp0uvr5_weights\%hp5%" (
206 |         echo %hp5% in .\uvr5_weights checked.
207 |         echo=
208 |     ) else (
209 |         echo failed. starting download from huggingface.
210 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlhp5% -d %~dp0\uvr5_weights -o %HP5%
211 |         if exist "%~dp0uvr5_weights\%hp5%" (echo download successful.) else (echo please try again!
212 |         echo=)
213 |     )
214 | 
215 | echo checking %hb%
216 | if exist "%~dp0%hb%" (
217 |         echo %hb% in .\pretrained checked.
218 |         echo=
219 |     ) else (
220 |         echo failed. starting download from huggingface.
221 |         %~dp0%aria2%\aria2c --console-log-level=error -c -x 16 -s 16 -k 1M %dlhb% -d %~dp0 -o %hb%
222 |         if exist "%~dp0%hb%" (echo download successful.) else (echo please try again!
223 |         echo=)
224 |     )
225 | 
226 | echo required files check finished.
227 | echo envfiles check complete.
228 | pause
229 | :end
230 | del flag.txt


--------------------------------------------------------------------------------
/export_onnx.py:
--------------------------------------------------------------------------------
 1 | from infer_pack.models_onnx import SynthesizerTrnMs256NSFsid
 2 | import torch
 3 | 
 4 | person = "Shiroha/shiroha.pth"
 5 | exported_path = "model.onnx"
 6 | 
 7 | 
 8 | 
 9 | cpt = torch.load(person, map_location="cpu")
10 | cpt["config"][-3]=cpt["weight"]["emb_g.weight"].shape[0]#n_spk
11 | print(*cpt["config"])
12 | net_g = SynthesizerTrnMs256NSFsid(*cpt["config"], is_half=False)
13 | net_g.load_state_dict(cpt["weight"], strict=False)
14 | 
15 | test_phone = torch.rand(1, 200, 256)
16 | test_phone_lengths = torch.tensor([200]).long()
17 | test_pitch = torch.randint(size=(1 ,200),low=5,high=255)
18 | test_pitchf = torch.rand(1, 200)
19 | test_ds = torch.LongTensor([0])
20 | test_rnd = torch.rand(1, 192, 200)
21 | input_names = ["phone", "phone_lengths", "pitch", "pitchf", "ds", "rnd"]
22 | output_names = ["audio", ]
23 | device="cpu"
24 | torch.onnx.export(net_g,
25 |             (
26 |                 test_phone.to(device),
27 |                 test_phone_lengths.to(device),
28 |                 test_pitch.to(device),
29 |                 test_pitchf.to(device),
30 |                 test_ds.to(device),
31 |                 test_rnd.to(device)
32 |             ),
33 |             exported_path,
34 |             dynamic_axes={
35 |                 "phone": [1],
36 |                 "pitch": [1],
37 |                 "pitchf": [1],
38 |                 "rnd": [2],
39 |             },
40 |             do_constant_folding=False,
41 |             opset_version=16,
42 |             verbose=False,
43 |             input_names=input_names,
44 |             output_names=output_names)


--------------------------------------------------------------------------------
/extract_f0_print.py:
--------------------------------------------------------------------------------
  1 | import os,traceback,sys,parselmouth
  2 | import librosa
  3 | import pyworld
  4 | from scipy.io import wavfile
  5 | import numpy as np,logging
  6 | logging.getLogger('numba').setLevel(logging.WARNING)
  7 | from multiprocessing import Process
  8 | 
  9 | exp_dir = sys.argv[1]
 10 | f = open("%s/extract_f0_feature.log"%exp_dir, "a+")
 11 | def printt(strr):
 12 |     print(strr)
 13 |     f.write("%s\n" % strr)
 14 |     f.flush()
 15 | 
 16 | n_p = int(sys.argv[2])
 17 | f0method = sys.argv[3]
 18 | 
 19 | class FeatureInput(object):
 20 |     def __init__(self, samplerate=16000, hop_size=160):
 21 |         self.fs = samplerate
 22 |         self.hop = hop_size
 23 | 
 24 |         self.f0_bin = 256
 25 |         self.f0_max = 1100.0
 26 |         self.f0_min = 50.0
 27 |         self.f0_mel_min = 1127 * np.log(1 + self.f0_min / 700)
 28 |         self.f0_mel_max = 1127 * np.log(1 + self.f0_max / 700)
 29 | 
 30 |     def compute_f0(self, path,f0_method):
 31 |         x, sr = librosa.load(path, self.fs)
 32 |         p_len=x.shape[0]//self.hop
 33 |         assert sr == self.fs
 34 |         if(f0_method=="pm"):
 35 |             time_step = 160 / 16000 * 1000
 36 |             f0_min = 50
 37 |             f0_max = 1100
 38 |             f0 = parselmouth.Sound(x, sr).to_pitch_ac(
 39 |                 time_step=time_step / 1000, voicing_threshold=0.6,
 40 |                 pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
 41 |             pad_size=(p_len - len(f0) + 1) // 2
 42 |             if(pad_size>0 or p_len - len(f0) - pad_size>0):
 43 |                 f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant')
 44 |         elif(f0_method=="harvest"):
 45 |             f0, t = pyworld.harvest(
 46 |                 x.astype(np.double),
 47 |                 fs=sr,
 48 |                 f0_ceil=1100,
 49 |                 frame_period=1000 * self.hop / sr,
 50 |             )
 51 |             f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs)
 52 |         elif(f0_method=="dio"):
 53 |             f0, t = pyworld.dio(
 54 |                 x.astype(np.double),
 55 |                 fs=sr,
 56 |                 f0_ceil=1100,
 57 |                 frame_period=1000 * self.hop / sr,
 58 |             )
 59 |             f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.fs)
 60 |         return f0
 61 | 
 62 |     def coarse_f0(self, f0):
 63 |         f0_mel = 1127 * np.log(1 + f0 / 700)
 64 |         f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - self.f0_mel_min) * (
 65 |             self.f0_bin - 2
 66 |         ) / (self.f0_mel_max - self.f0_mel_min) + 1
 67 | 
 68 |         # use 0 or 1
 69 |         f0_mel[f0_mel <= 1] = 1
 70 |         f0_mel[f0_mel > self.f0_bin - 1] = self.f0_bin - 1
 71 |         f0_coarse = np.rint(f0_mel).astype(np.int)
 72 |         assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, (
 73 |             f0_coarse.max(),
 74 |             f0_coarse.min(),
 75 |         )
 76 |         return f0_coarse
 77 | 
 78 |     def go(self,paths,f0_method):
 79 |         if (len(paths) == 0): printt("no-f0-todo")
 80 |         else:
 81 |             printt("todo-f0-%s"%len(paths))
 82 |             n=max(len(paths)//5,1)#每个进程最多打印5条
 83 |             for idx,(inp_path,opt_path1,opt_path2) in enumerate(paths):
 84 |                 try:
 85 |                     if(idx%n==0):printt("f0ing,now-%s,all-%s,-%s"%(idx,len(paths),inp_path))
 86 |                     if(os.path.exists(opt_path1+".npy")==True and os.path.exists(opt_path2+".npy")==True):continue
 87 |                     featur_pit = self.compute_f0(inp_path,f0_method)
 88 |                     np.save(opt_path2,featur_pit,allow_pickle=False,)#nsf
 89 |                     coarse_pit = self.coarse_f0(featur_pit)
 90 |                     np.save(opt_path1,coarse_pit,allow_pickle=False,)#ori
 91 |                 except:
 92 |                     printt("f0fail-%s-%s-%s" % (idx, inp_path,traceback.format_exc()))
 93 | 
 94 | if __name__=='__main__':
 95 |     # exp_dir=r"E:\codes\py39\dataset\mi-test"
 96 |     # n_p=16
 97 |     # f = open("%s/log_extract_f0.log"%exp_dir, "w")
 98 |     printt(sys.argv)
 99 |     featureInput = FeatureInput()
100 |     paths=[]
101 |     inp_root= "%s/1_16k_wavs"%(exp_dir)
102 |     opt_root1="%s/2a_f0"%(exp_dir)
103 |     opt_root2="%s/2b-f0nsf"%(exp_dir)
104 | 
105 |     os.makedirs(opt_root1,exist_ok=True)
106 |     os.makedirs(opt_root2,exist_ok=True)
107 |     for name in sorted(list(os.listdir(inp_root))):
108 |         inp_path="%s/%s"%(inp_root,name)
109 |         if ("spec" in inp_path): continue
110 |         opt_path1="%s/%s"%(opt_root1,name)
111 |         opt_path2="%s/%s"%(opt_root2,name)
112 |         paths.append([inp_path,opt_path1,opt_path2])
113 | 
114 |     ps=[]
115 |     for i in range(n_p):
116 |         p=Process(target=featureInput.go,args=(paths[i::n_p],f0method,))
117 |         p.start()
118 |         ps.append(p)
119 |     for p in ps:
120 |         p.join()
121 | 


--------------------------------------------------------------------------------
/extract_feature_print.py:
--------------------------------------------------------------------------------
 1 | import os,sys,traceback
 2 | if len(sys.argv) == 4:
 3 |     n_part=int(sys.argv[1])
 4 |     i_part=int(sys.argv[2])
 5 |     exp_dir=sys.argv[3]
 6 | else:
 7 |     n_part=int(sys.argv[1])
 8 |     i_part=int(sys.argv[2])
 9 |     i_gpu=sys.argv[3]
10 |     exp_dir=sys.argv[4]
11 |     os.environ["CUDA_VISIBLE_DEVICES"]=str(i_gpu)
12 | 
13 | import torch
14 | import torch.nn.functional as F
15 | import soundfile as sf
16 | import numpy as np
17 | from fairseq import checkpoint_utils
18 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
19 | 
20 | f = open("%s/extract_f0_feature.log"%exp_dir, "a+")
21 | def printt(strr):
22 |     print(strr)
23 |     f.write("%s\n" % strr)
24 |     f.flush()
25 | printt(sys.argv)
26 | model_path = "hubert_base.pt"
27 | 
28 | printt(exp_dir)
29 | wavPath = "%s/1_16k_wavs"%exp_dir
30 | outPath = "%s/3_feature256"%exp_dir
31 | os.makedirs(outPath,exist_ok=True)
32 | # wave must be 16k, hop_size=320
33 | def readwave(wav_path, normalize=False):
34 |     wav, sr = sf.read(wav_path)
35 |     assert sr == 16000
36 |     feats = torch.from_numpy(wav).float()
37 |     if feats.dim() == 2:  # double channels
38 |         feats = feats.mean(-1)
39 |     assert feats.dim() == 1, feats.dim()
40 |     if normalize:
41 |         with torch.no_grad():
42 |             feats = F.layer_norm(feats, feats.shape)
43 |     feats = feats.view(1, -1)
44 |     return feats
45 | # HuBERT model
46 | printt("load model(s) from {}".format(model_path))
47 | models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
48 |     [model_path],
49 |     suffix="",
50 | )
51 | model = models[0]
52 | model = model.to(device)
53 | if torch.cuda.is_available():
54 |     model = model.half()
55 | model.eval()
56 | 
57 | todo=sorted(list(os.listdir(wavPath)))[i_part::n_part]
58 | n = max(1,len(todo) // 10)  # 最多打印十条
59 | if(len(todo)==0):printt("no-feature-todo")
60 | else:
61 |     printt("all-feature-%s"%len(todo))
62 |     for idx,file in enumerate(todo):
63 |         try:
64 |             if file.endswith(".wav"):
65 |                 wav_path = "%s/%s"%(wavPath,file)
66 |                 out_path = "%s/%s"%(outPath,file.replace("wav","npy"))
67 | 
68 |                 if(os.path.exists(out_path)):continue
69 | 
70 |                 feats = readwave(wav_path, normalize=saved_cfg.task.normalize)
71 |                 padding_mask = torch.BoolTensor(feats.shape).fill_(False)
72 |                 inputs = {
73 |                     "source": feats.half().to(device) if torch.cuda.is_available() else feats.to(device),
74 |                     "padding_mask": padding_mask.to(device),
75 |                     "output_layer": 9,  # layer 9
76 |                 }
77 |                 with torch.no_grad():
78 |                     logits = model.extract_features(**inputs)
79 |                     feats = model.final_proj(logits[0])
80 | 
81 |                 feats = feats.squeeze(0).float().cpu().numpy()
82 |                 if(np.isnan(feats).sum()==0):
83 |                     np.save(out_path, feats, allow_pickle=False)
84 |                 else:
85 |                     printt("%s-contains nan"%file)
86 |                 if (idx % n == 0):printt("now-%s,all-%s,%s,%s"%(len(todo),idx,file,feats.shape))
87 |         except:
88 |             printt(traceback.format_exc())
89 |     printt("all-feature-done")
90 | 


--------------------------------------------------------------------------------
/go-web_jp.bat:
--------------------------------------------------------------------------------
1 | runtime\python.exe infer-web_jp.py


--------------------------------------------------------------------------------
/infer/infer-pm-index256.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | 
  3 | 对源特征进行检索
  4 | '''
  5 | import torch, pdb, os,parselmouth
  6 | os.environ["CUDA_VISIBLE_DEVICES"]="0"
  7 | import numpy as np
  8 | import soundfile as sf
  9 | # from models import SynthesizerTrn256#hifigan_nonsf
 10 | # from infer_pack.models import SynthesizerTrn256NSF as SynthesizerTrn256#hifigan_nsf
 11 | from infer_pack.models import SynthesizerTrnMs256NSFsid as SynthesizerTrn256#hifigan_nsf
 12 | # from infer_pack.models import SynthesizerTrnMs256NSFsid_sim as SynthesizerTrn256#hifigan_nsf
 13 | # from models import SynthesizerTrn256NSFsim as SynthesizerTrn256#hifigan_nsf
 14 | # from models import SynthesizerTrn256NSFsimFlow as SynthesizerTrn256#hifigan_nsf
 15 | 
 16 | 
 17 | from scipy.io import wavfile
 18 | from fairseq import checkpoint_utils
 19 | # import pyworld
 20 | import librosa
 21 | import torch.nn.functional as F
 22 | import scipy.signal as signal
 23 | # import torchcrepe
 24 | from time import time as ttime
 25 | 
 26 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 27 | model_path = r"E:\codes\py39\vits_vc_gpu_train\hubert_base.pt"#
 28 | print("load model(s) from {}".format(model_path))
 29 | models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
 30 |     [model_path],
 31 |     suffix="",
 32 | )
 33 | model = models[0]
 34 | model = model.to(device)
 35 | model = model.half()
 36 | model.eval()
 37 | 
 38 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],183,256,is_half=True)#hifigan#512#256
 39 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],109,256,is_half=True)#hifigan#512#256
 40 | net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],183,256,is_half=True)#hifigan#512#256#no_dropout
 41 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,3,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2,2],512,[16,16,4,4],0)#ts3
 42 | # net_g = SynthesizerTrn256(1025,32,192,192,768,2,6,3,0.1,"1", [3,7,11],[[1,3,5], [1,3,5], [1,3,5]],[10,10,2],512,[16,16,4],0)#hifigan-ps-sr
 43 | #
 44 | # net_g = SynthesizerTrn(1025, 32, 192, 192, 768, 2, 6, 3, 0.1, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [5,5], 512, [15,15], 0)#ms
 45 | # net_g = SynthesizerTrn(1025, 32, 192, 192, 768, 2, 6, 3, 0.1, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,10], 512, [16,16], 0)#idwt2
 46 | 
 47 | # weights=torch.load("infer/ft-mi_1k-noD.pt")
 48 | # weights=torch.load("infer/ft-mi-freeze-vocoder-flow-enc_q_1k.pt")
 49 | # weights=torch.load("infer/ft-mi-freeze-vocoder_true_1k.pt")
 50 | # weights=torch.load("infer/ft-mi-sim1k.pt")
 51 | weights=torch.load("infer/ft-mi-no_opt-no_dropout.pt")
 52 | print(net_g.load_state_dict(weights,strict=True))
 53 | 
 54 | net_g.eval().to(device)
 55 | net_g.half()
 56 | def get_f0(x, p_len,f0_up_key=0):
 57 | 
 58 |     time_step = 160 / 16000 * 1000
 59 |     f0_min = 50
 60 |     f0_max = 1100
 61 |     f0_mel_min = 1127 * np.log(1 + f0_min / 700)
 62 |     f0_mel_max = 1127 * np.log(1 + f0_max / 700)
 63 | 
 64 |     f0 = parselmouth.Sound(x, 16000).to_pitch_ac(
 65 |         time_step=time_step / 1000, voicing_threshold=0.6,
 66 |         pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
 67 | 
 68 |     pad_size=(p_len - len(f0) + 1) // 2
 69 |     if(pad_size>0 or p_len - len(f0) - pad_size>0):
 70 |         f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant')
 71 |     f0 *= pow(2, f0_up_key / 12)
 72 |     f0bak = f0.copy()
 73 | 
 74 |     f0_mel = 1127 * np.log(1 + f0 / 700)
 75 |     f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (f0_mel_max - f0_mel_min) + 1
 76 |     f0_mel[f0_mel <= 1] = 1
 77 |     f0_mel[f0_mel > 255] = 255
 78 |     # f0_mel[f0_mel > 188] = 188
 79 |     f0_coarse = np.rint(f0_mel).astype(np.int)
 80 |     return f0_coarse, f0bak
 81 | 
 82 | import faiss
 83 | index=faiss.read_index("infer/added_IVF512_Flat_mi_baseline_src_feat.index")
 84 | big_npy=np.load("infer/big_src_feature_mi.npy")
 85 | ta0=ta1=ta2=0
 86 | for idx,name in enumerate(["冬之花clip1.wav",]):##
 87 |     wav_path = "todo-songs/%s" % name#
 88 |     f0_up_key=-2#
 89 |     audio, sampling_rate = sf.read(wav_path)
 90 |     if len(audio.shape) > 1:
 91 |         audio = librosa.to_mono(audio.transpose(1, 0))
 92 |     if sampling_rate != 16000:
 93 |         audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=16000)
 94 | 
 95 | 
 96 |     feats = torch.from_numpy(audio).float()
 97 |     if feats.dim() == 2:  # double channels
 98 |         feats = feats.mean(-1)
 99 |     assert feats.dim() == 1, feats.dim()
100 |     feats = feats.view(1, -1)
101 |     padding_mask = torch.BoolTensor(feats.shape).fill_(False)
102 |     inputs = {
103 |         "source": feats.half().to(device),
104 |         "padding_mask": padding_mask.to(device),
105 |         "output_layer": 9,  # layer 9
106 |     }
107 |     if torch.cuda.is_available(): torch.cuda.synchronize()
108 |     t0=ttime()
109 |     with torch.no_grad():
110 |         logits = model.extract_features(**inputs)
111 |         feats = model.final_proj(logits[0])
112 | 
113 |     ####索引优化
114 |     npy = feats[0].cpu().numpy().astype("float32")
115 |     D, I = index.search(npy, 1)
116 |     feats = torch.from_numpy(big_npy[I.squeeze()].astype("float16")).unsqueeze(0).to(device)
117 | 
118 |     feats=F.interpolate(feats.permute(0,2,1),scale_factor=2).permute(0,2,1)
119 |     if torch.cuda.is_available(): torch.cuda.synchronize()
120 |     t1=ttime()
121 |     # p_len = min(feats.shape[1],10000,pitch.shape[0])#太大了爆显存
122 |     p_len = min(feats.shape[1],10000)#
123 |     pitch, pitchf = get_f0(audio, p_len,f0_up_key)
124 |     p_len = min(feats.shape[1],10000,pitch.shape[0])#太大了爆显存
125 |     if torch.cuda.is_available(): torch.cuda.synchronize()
126 |     t2=ttime()
127 |     feats = feats[:,:p_len, :]
128 |     pitch = pitch[:p_len]
129 |     pitchf = pitchf[:p_len]
130 |     p_len = torch.LongTensor([p_len]).to(device)
131 |     pitch = torch.LongTensor(pitch).unsqueeze(0).to(device)
132 |     sid=torch.LongTensor([0]).to(device)
133 |     pitchf = torch.FloatTensor(pitchf).unsqueeze(0).to(device)
134 |     with torch.no_grad():
135 |         audio = net_g.infer(feats, p_len,pitch,pitchf,sid)[0][0, 0].data.cpu().float().numpy()#nsf
136 |     if torch.cuda.is_available(): torch.cuda.synchronize()
137 |     t3=ttime()
138 |     ta0+=(t1-t0)
139 |     ta1+=(t2-t1)
140 |     ta2+=(t3-t2)
141 |     # wavfile.write("ft-mi_1k-index256-noD-%s.wav"%name, 40000, audio)##
142 |     # wavfile.write("ft-mi-freeze-vocoder-flow-enc_q_1k-%s.wav"%name, 40000, audio)##
143 |     # wavfile.write("ft-mi-sim1k-%s.wav"%name, 40000, audio)##
144 |     wavfile.write("ft-mi-no_opt-no_dropout-%s.wav"%name, 40000, audio)##
145 | 
146 | 
147 | print(ta0,ta1,ta2)#
148 | 


--------------------------------------------------------------------------------
/infer/train-index.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | 格式：直接cid为自带的index位；aid放不下了，通过字典来查，反正就5w个
 3 | '''
 4 | import faiss,numpy as np,os
 5 | 
 6 | # ###########如果是原始特征要先写save
 7 | inp_root=r"E:\codes\py39\dataset\mi\2-co256"
 8 | npys=[]
 9 | for name in sorted(list(os.listdir(inp_root))):
10 |     phone=np.load("%s/%s"%(inp_root,name))
11 |     npys.append(phone)
12 | big_npy=np.concatenate(npys,0)
13 | print(big_npy.shape)#(6196072, 192)#fp32#4.43G
14 | np.save("infer/big_src_feature_mi.npy",big_npy)
15 | 
16 | ##################train+add
17 | # big_npy=np.load("/bili-coeus/jupyter/jupyterhub-liujing04/vits_ch/inference_f0/big_src_feature_mi.npy")
18 | print(big_npy.shape)
19 | index = faiss.index_factory(256, "IVF512,Flat")#mi
20 | print("training")
21 | index_ivf = faiss.extract_index_ivf(index)#
22 | index_ivf.nprobe = 9
23 | index.train(big_npy)
24 | faiss.write_index(index, 'infer/trained_IVF512_Flat_mi_baseline_src_feat.index')
25 | print("adding")
26 | index.add(big_npy)
27 | faiss.write_index(index,"infer/added_IVF512_Flat_mi_baseline_src_feat.index")
28 | '''
29 | 大小（都是FP32）
30 | big_src_feature 2.95G
31 |     (3098036, 256)
32 | big_emb         4.43G
33 |     (6196072, 192)
34 | big_emb双倍是因为求特征要repeat后再加pitch
35 | 
36 | '''


--------------------------------------------------------------------------------
/infer/trans_weights.py:
--------------------------------------------------------------------------------
 1 | import torch,pdb
 2 | 
 3 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-suc\G_1000.pth")["model"]#sim_nsf#
 4 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-freeze-vocoder-flow-enc_q\G_1000.pth")["model"]#sim_nsf#
 5 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-freeze-vocoder\G_1000.pth")["model"]#sim_nsf#
 6 | # a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-test\G_1000.pth")["model"]#sim_nsf#
 7 | a=torch.load(r"E:\codes\py39\vits_vc_gpu_train\logs\ft-mi-no_opt-no_dropout\G_1000.pth")["model"]#sim_nsf#
 8 | for key in a.keys():a[key]=a[key].half()
 9 | # torch.save(a,"ft-mi-freeze-vocoder_true_1k.pt")#
10 | # torch.save(a,"ft-mi-sim1k.pt")#
11 | torch.save(a,"ft-mi-no_opt-no_dropout.pt")#
12 | 


--------------------------------------------------------------------------------
/infer_pack/commons.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import numpy as np
  3 | import torch
  4 | from torch import nn
  5 | from torch.nn import functional as F
  6 | 
  7 | 
  8 | def init_weights(m, mean=0.0, std=0.01):
  9 |     classname = m.__class__.__name__
 10 |     if classname.find("Conv") != -1:
 11 |         m.weight.data.normal_(mean, std)
 12 | 
 13 | 
 14 | def get_padding(kernel_size, dilation=1):
 15 |     return int((kernel_size * dilation - dilation) / 2)
 16 | 
 17 | 
 18 | def convert_pad_shape(pad_shape):
 19 |     l = pad_shape[::-1]
 20 |     pad_shape = [item for sublist in l for item in sublist]
 21 |     return pad_shape
 22 | 
 23 | 
 24 | def kl_divergence(m_p, logs_p, m_q, logs_q):
 25 |     """KL(P||Q)"""
 26 |     kl = (logs_q - logs_p) - 0.5
 27 |     kl += (
 28 |         0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
 29 |     )
 30 |     return kl
 31 | 
 32 | 
 33 | def rand_gumbel(shape):
 34 |     """Sample from the Gumbel distribution, protect from overflows."""
 35 |     uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
 36 |     return -torch.log(-torch.log(uniform_samples))
 37 | 
 38 | 
 39 | def rand_gumbel_like(x):
 40 |     g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
 41 |     return g
 42 | 
 43 | 
 44 | def slice_segments(x, ids_str, segment_size=4):
 45 |     ret = torch.zeros_like(x[:, :, :segment_size])
 46 |     for i in range(x.size(0)):
 47 |         idx_str = ids_str[i]
 48 |         idx_end = idx_str + segment_size
 49 |         ret[i] = x[i, :, idx_str:idx_end]
 50 |     return ret
 51 | def slice_segments2(x, ids_str, segment_size=4):
 52 |     ret = torch.zeros_like(x[:,  :segment_size])
 53 |     for i in range(x.size(0)):
 54 |         idx_str = ids_str[i]
 55 |         idx_end = idx_str + segment_size
 56 |         ret[i] = x[i, idx_str:idx_end]
 57 |     return ret
 58 | 
 59 | 
 60 | def rand_slice_segments(x, x_lengths=None, segment_size=4):
 61 |     b, d, t = x.size()
 62 |     if x_lengths is None:
 63 |         x_lengths = t
 64 |     ids_str_max = x_lengths - segment_size + 1
 65 |     ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
 66 |     ret = slice_segments(x, ids_str, segment_size)
 67 |     return ret, ids_str
 68 | 
 69 | 
 70 | def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
 71 |     position = torch.arange(length, dtype=torch.float)
 72 |     num_timescales = channels // 2
 73 |     log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
 74 |         num_timescales - 1
 75 |     )
 76 |     inv_timescales = min_timescale * torch.exp(
 77 |         torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
 78 |     )
 79 |     scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
 80 |     signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
 81 |     signal = F.pad(signal, [0, 0, 0, channels % 2])
 82 |     signal = signal.view(1, channels, length)
 83 |     return signal
 84 | 
 85 | 
 86 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
 87 |     b, channels, length = x.size()
 88 |     signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
 89 |     return x + signal.to(dtype=x.dtype, device=x.device)
 90 | 
 91 | 
 92 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
 93 |     b, channels, length = x.size()
 94 |     signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
 95 |     return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
 96 | 
 97 | 
 98 | def subsequent_mask(length):
 99 |     mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
100 |     return mask
101 | 
102 | 
103 | @torch.jit.script
104 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
105 |     n_channels_int = n_channels[0]
106 |     in_act = input_a + input_b
107 |     t_act = torch.tanh(in_act[:, :n_channels_int, :])
108 |     s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
109 |     acts = t_act * s_act
110 |     return acts
111 | 
112 | 
113 | def convert_pad_shape(pad_shape):
114 |     l = pad_shape[::-1]
115 |     pad_shape = [item for sublist in l for item in sublist]
116 |     return pad_shape
117 | 
118 | 
119 | def shift_1d(x):
120 |     x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
121 |     return x
122 | 
123 | 
124 | def sequence_mask(length, max_length=None):
125 |     if max_length is None:
126 |         max_length = length.max()
127 |     x = torch.arange(max_length, dtype=length.dtype, device=length.device)
128 |     return x.unsqueeze(0) < length.unsqueeze(1)
129 | 
130 | 
131 | def generate_path(duration, mask):
132 |     """
133 |     duration: [b, 1, t_x]
134 |     mask: [b, 1, t_y, t_x]
135 |     """
136 |     device = duration.device
137 | 
138 |     b, _, t_y, t_x = mask.shape
139 |     cum_duration = torch.cumsum(duration, -1)
140 | 
141 |     cum_duration_flat = cum_duration.view(b * t_x)
142 |     path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
143 |     path = path.view(b, t_x, t_y)
144 |     path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
145 |     path = path.unsqueeze(1).transpose(2, 3) * mask
146 |     return path
147 | 
148 | 
149 | def clip_grad_value_(parameters, clip_value, norm_type=2):
150 |     if isinstance(parameters, torch.Tensor):
151 |         parameters = [parameters]
152 |     parameters = list(filter(lambda p: p.grad is not None, parameters))
153 |     norm_type = float(norm_type)
154 |     if clip_value is not None:
155 |         clip_value = float(clip_value)
156 | 
157 |     total_norm = 0
158 |     for p in parameters:
159 |         param_norm = p.grad.data.norm(norm_type)
160 |         total_norm += param_norm.item() ** norm_type
161 |         if clip_value is not None:
162 |             p.grad.data.clamp_(min=-clip_value, max=clip_value)
163 |     total_norm = total_norm ** (1.0 / norm_type)
164 |     return total_norm
165 | 


--------------------------------------------------------------------------------
/infer_pack/transforms.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch.nn import functional as F
  3 | 
  4 | import numpy as np
  5 | 
  6 | 
  7 | DEFAULT_MIN_BIN_WIDTH = 1e-3
  8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3
  9 | DEFAULT_MIN_DERIVATIVE = 1e-3
 10 | 
 11 | 
 12 | def piecewise_rational_quadratic_transform(inputs, 
 13 |                                            unnormalized_widths,
 14 |                                            unnormalized_heights,
 15 |                                            unnormalized_derivatives,
 16 |                                            inverse=False,
 17 |                                            tails=None, 
 18 |                                            tail_bound=1.,
 19 |                                            min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 20 |                                            min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 21 |                                            min_derivative=DEFAULT_MIN_DERIVATIVE):
 22 | 
 23 |     if tails is None:
 24 |         spline_fn = rational_quadratic_spline
 25 |         spline_kwargs = {}
 26 |     else:
 27 |         spline_fn = unconstrained_rational_quadratic_spline
 28 |         spline_kwargs = {
 29 |             'tails': tails,
 30 |             'tail_bound': tail_bound
 31 |         }
 32 | 
 33 |     outputs, logabsdet = spline_fn(
 34 |             inputs=inputs,
 35 |             unnormalized_widths=unnormalized_widths,
 36 |             unnormalized_heights=unnormalized_heights,
 37 |             unnormalized_derivatives=unnormalized_derivatives,
 38 |             inverse=inverse,
 39 |             min_bin_width=min_bin_width,
 40 |             min_bin_height=min_bin_height,
 41 |             min_derivative=min_derivative,
 42 |             **spline_kwargs
 43 |     )
 44 |     return outputs, logabsdet
 45 | 
 46 | 
 47 | def searchsorted(bin_locations, inputs, eps=1e-6):
 48 |     bin_locations[..., -1] += eps
 49 |     return torch.sum(
 50 |         inputs[..., None] >= bin_locations,
 51 |         dim=-1
 52 |     ) - 1
 53 | 
 54 | 
 55 | def unconstrained_rational_quadratic_spline(inputs,
 56 |                                             unnormalized_widths,
 57 |                                             unnormalized_heights,
 58 |                                             unnormalized_derivatives,
 59 |                                             inverse=False,
 60 |                                             tails='linear',
 61 |                                             tail_bound=1.,
 62 |                                             min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 63 |                                             min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 64 |                                             min_derivative=DEFAULT_MIN_DERIVATIVE):
 65 |     inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
 66 |     outside_interval_mask = ~inside_interval_mask
 67 | 
 68 |     outputs = torch.zeros_like(inputs)
 69 |     logabsdet = torch.zeros_like(inputs)
 70 | 
 71 |     if tails == 'linear':
 72 |         unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
 73 |         constant = np.log(np.exp(1 - min_derivative) - 1)
 74 |         unnormalized_derivatives[..., 0] = constant
 75 |         unnormalized_derivatives[..., -1] = constant
 76 | 
 77 |         outputs[outside_interval_mask] = inputs[outside_interval_mask]
 78 |         logabsdet[outside_interval_mask] = 0
 79 |     else:
 80 |         raise RuntimeError('{} tails are not implemented.'.format(tails))
 81 | 
 82 |     outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline(
 83 |         inputs=inputs[inside_interval_mask],
 84 |         unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
 85 |         unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
 86 |         unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
 87 |         inverse=inverse,
 88 |         left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
 89 |         min_bin_width=min_bin_width,
 90 |         min_bin_height=min_bin_height,
 91 |         min_derivative=min_derivative
 92 |     )
 93 | 
 94 |     return outputs, logabsdet
 95 | 
 96 | def rational_quadratic_spline(inputs,
 97 |                               unnormalized_widths,
 98 |                               unnormalized_heights,
 99 |                               unnormalized_derivatives,
100 |                               inverse=False,
101 |                               left=0., right=1., bottom=0., top=1.,
102 |                               min_bin_width=DEFAULT_MIN_BIN_WIDTH,
103 |                               min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
104 |                               min_derivative=DEFAULT_MIN_DERIVATIVE):
105 |     if torch.min(inputs) < left or torch.max(inputs) > right:
106 |         raise ValueError('Input to a transform is not within its domain')
107 | 
108 |     num_bins = unnormalized_widths.shape[-1]
109 | 
110 |     if min_bin_width * num_bins > 1.0:
111 |         raise ValueError('Minimal bin width too large for the number of bins')
112 |     if min_bin_height * num_bins > 1.0:
113 |         raise ValueError('Minimal bin height too large for the number of bins')
114 | 
115 |     widths = F.softmax(unnormalized_widths, dim=-1)
116 |     widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
117 |     cumwidths = torch.cumsum(widths, dim=-1)
118 |     cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0)
119 |     cumwidths = (right - left) * cumwidths + left
120 |     cumwidths[..., 0] = left
121 |     cumwidths[..., -1] = right
122 |     widths = cumwidths[..., 1:] - cumwidths[..., :-1]
123 | 
124 |     derivatives = min_derivative + F.softplus(unnormalized_derivatives)
125 | 
126 |     heights = F.softmax(unnormalized_heights, dim=-1)
127 |     heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
128 |     cumheights = torch.cumsum(heights, dim=-1)
129 |     cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0)
130 |     cumheights = (top - bottom) * cumheights + bottom
131 |     cumheights[..., 0] = bottom
132 |     cumheights[..., -1] = top
133 |     heights = cumheights[..., 1:] - cumheights[..., :-1]
134 | 
135 |     if inverse:
136 |         bin_idx = searchsorted(cumheights, inputs)[..., None]
137 |     else:
138 |         bin_idx = searchsorted(cumwidths, inputs)[..., None]
139 | 
140 |     input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
141 |     input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
142 | 
143 |     input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
144 |     delta = heights / widths
145 |     input_delta = delta.gather(-1, bin_idx)[..., 0]
146 | 
147 |     input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
148 |     input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
149 | 
150 |     input_heights = heights.gather(-1, bin_idx)[..., 0]
151 | 
152 |     if inverse:
153 |         a = (((inputs - input_cumheights) * (input_derivatives
154 |                                              + input_derivatives_plus_one
155 |                                              - 2 * input_delta)
156 |               + input_heights * (input_delta - input_derivatives)))
157 |         b = (input_heights * input_derivatives
158 |              - (inputs - input_cumheights) * (input_derivatives
159 |                                               + input_derivatives_plus_one
160 |                                               - 2 * input_delta))
161 |         c = - input_delta * (inputs - input_cumheights)
162 | 
163 |         discriminant = b.pow(2) - 4 * a * c
164 |         assert (discriminant >= 0).all()
165 | 
166 |         root = (2 * c) / (-b - torch.sqrt(discriminant))
167 |         outputs = root * input_bin_widths + input_cumwidths
168 | 
169 |         theta_one_minus_theta = root * (1 - root)
170 |         denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
171 |                                      * theta_one_minus_theta)
172 |         derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2)
173 |                                                      + 2 * input_delta * theta_one_minus_theta
174 |                                                      + input_derivatives * (1 - root).pow(2))
175 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
176 | 
177 |         return outputs, -logabsdet
178 |     else:
179 |         theta = (inputs - input_cumwidths) / input_bin_widths
180 |         theta_one_minus_theta = theta * (1 - theta)
181 | 
182 |         numerator = input_heights * (input_delta * theta.pow(2)
183 |                                      + input_derivatives * theta_one_minus_theta)
184 |         denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
185 |                                      * theta_one_minus_theta)
186 |         outputs = input_cumheights + numerator / denominator
187 | 
188 |         derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2)
189 |                                                      + 2 * input_delta * theta_one_minus_theta
190 |                                                      + input_derivatives * (1 - theta).pow(2))
191 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
192 | 
193 |         return outputs, logabsdet
194 | 


--------------------------------------------------------------------------------
/infer_uvr5.py:
--------------------------------------------------------------------------------
  1 | import os,sys,torch,warnings,pdb
  2 | warnings.filterwarnings("ignore")
  3 | import librosa
  4 | import importlib
  5 | import  numpy as np
  6 | import hashlib , math
  7 | from tqdm import tqdm
  8 | from uvr5_pack.lib_v5 import spec_utils
  9 | from uvr5_pack.utils import _get_name_params,inference
 10 | from uvr5_pack.lib_v5.model_param_init import ModelParameters
 11 | from scipy.io import wavfile
 12 | 
 13 | class  _audio_pre_():
 14 |     def __init__(self, model_path,device,is_half):
 15 |         self.model_path = model_path
 16 |         self.device = device
 17 |         self.data = {
 18 |             # Processing Options
 19 |             'postprocess': False,
 20 |             'tta': False,
 21 |             # Constants
 22 |             'window_size': 512,
 23 |             'agg': 10,
 24 |             'high_end_process': 'mirroring',
 25 |         }
 26 |         nn_arch_sizes = [
 27 |             31191, # default
 28 |             33966,61968, 123821, 123812, 537238 # custom
 29 |         ]
 30 |         self.nn_architecture = list('{}KB'.format(s) for s in nn_arch_sizes)
 31 |         model_size = math.ceil(os.stat(model_path ).st_size / 1024)
 32 |         nn_architecture = '{}KB'.format(min(nn_arch_sizes, key=lambda x:abs(x-model_size)))
 33 |         nets = importlib.import_module('uvr5_pack.lib_v5.nets' + f'_{nn_architecture}'.replace('_{}KB'.format(nn_arch_sizes[0]), ''), package=None)
 34 |         model_hash = hashlib.md5(open(model_path,'rb').read()).hexdigest()
 35 |         param_name ,model_params_d = _get_name_params(model_path , model_hash)
 36 | 
 37 |         mp = ModelParameters(model_params_d)
 38 |         model = nets.CascadedASPPNet(mp.param['bins'] * 2)
 39 |         cpk = torch.load( model_path , map_location='cpu')  
 40 |         model.load_state_dict(cpk)
 41 |         model.eval()
 42 |         if(is_half==True):model = model.half().to(device)
 43 |         else:model = model.to(device)
 44 | 
 45 |         self.mp = mp
 46 |         self.model = model
 47 | 
 48 |     def _path_audio_(self, music_file ,ins_root=None,vocal_root=None):
 49 |         if(ins_root is None and vocal_root is None):return "No save root."
 50 |         name=os.path.basename(music_file)
 51 |         if(ins_root is not None):os.makedirs(ins_root, exist_ok=True)
 52 |         if(vocal_root is not None):os.makedirs(vocal_root , exist_ok=True)
 53 |         X_wave, y_wave, X_spec_s, y_spec_s = {}, {}, {}, {}
 54 |         bands_n = len(self.mp.param['band'])
 55 |         # print(bands_n)
 56 |         for d in range(bands_n, 0, -1): 
 57 |             bp = self.mp.param['band'][d]
 58 |             if d == bands_n: # high-end band
 59 |                 X_wave[d], _ = librosa.core.load(#理论上librosa读取可能对某些音频有bug，应该上ffmpeg读取，但是太麻烦了弃坑
 60 |                     music_file, bp['sr'], False, dtype=np.float32, res_type=bp['res_type'])
 61 |                 if X_wave[d].ndim == 1:
 62 |                     X_wave[d] = np.asfortranarray([X_wave[d], X_wave[d]])
 63 |             else: # lower bands
 64 |                 X_wave[d] = librosa.core.resample(X_wave[d+1], self.mp.param['band'][d+1]['sr'], bp['sr'], res_type=bp['res_type'])
 65 |             # Stft of wave source
 66 |             X_spec_s[d] = spec_utils.wave_to_spectrogram_mt(X_wave[d], bp['hl'], bp['n_fft'], self.mp.param['mid_side'], self.mp.param['mid_side_b2'], self.mp.param['reverse'])
 67 |             # pdb.set_trace()
 68 |             if d == bands_n and self.data['high_end_process'] != 'none':
 69 |                 input_high_end_h = (bp['n_fft']//2 - bp['crop_stop']) + ( self.mp.param['pre_filter_stop'] - self.mp.param['pre_filter_start'])
 70 |                 input_high_end = X_spec_s[d][:, bp['n_fft']//2-input_high_end_h:bp['n_fft']//2, :]
 71 | 
 72 |         X_spec_m = spec_utils.combine_spectrograms(X_spec_s, self.mp)
 73 |         aggresive_set = float(self.data['agg']/100)
 74 |         aggressiveness = {'value': aggresive_set, 'split_bin': self.mp.param['band'][1]['crop_stop']}
 75 |         with torch.no_grad():
 76 |             pred, X_mag, X_phase = inference(X_spec_m,self.device,self.model, aggressiveness,self.data)
 77 |         # Postprocess
 78 |         if self.data['postprocess']:
 79 |             pred_inv = np.clip(X_mag - pred, 0, np.inf)
 80 |             pred = spec_utils.mask_silence(pred, pred_inv)
 81 |         y_spec_m = pred * X_phase
 82 |         v_spec_m = X_spec_m - y_spec_m
 83 | 
 84 |         if (ins_root is not None):
 85 |             if self.data['high_end_process'].startswith('mirroring'):
 86 |                 input_high_end_ = spec_utils.mirroring(self.data['high_end_process'], y_spec_m, input_high_end, self.mp)
 87 |                 wav_instrument = spec_utils.cmb_spectrogram_to_wave(y_spec_m, self.mp,input_high_end_h, input_high_end_)
 88 |             else:
 89 |                 wav_instrument = spec_utils.cmb_spectrogram_to_wave(y_spec_m, self.mp)
 90 |             print ('%s instruments done'%name)
 91 |             wavfile.write(os.path.join(ins_root, 'instrument_{}.wav'.format(name) ), self.mp.param['sr'], (np.array(wav_instrument)*32768).astype("int16"))  #
 92 |         if (vocal_root is not None):
 93 |             if self.data['high_end_process'].startswith('mirroring'):
 94 |                 input_high_end_ = spec_utils.mirroring(self.data['high_end_process'],  v_spec_m, input_high_end, self.mp)
 95 |                 wav_vocals = spec_utils.cmb_spectrogram_to_wave(v_spec_m, self.mp, input_high_end_h, input_high_end_)
 96 |             else:
 97 |                 wav_vocals = spec_utils.cmb_spectrogram_to_wave(v_spec_m, self.mp)
 98 |             print ('%s vocals done'%name)
 99 |             wavfile.write(os.path.join(vocal_root , 'vocal_{}.wav'.format(name) ), self.mp.param['sr'], (np.array(wav_vocals)*32768).astype("int16"))
100 | 
101 | if __name__ == '__main__':
102 |     device = 'cuda'
103 |     is_half=True
104 |     model_path='uvr5_weights/2_HP-UVR.pth'
105 |     pre_fun = _audio_pre_(model_path=model_path,device=device,is_half=True)
106 |     audio_path = '神女劈观.aac'
107 |     save_path = 'opt'
108 |     pre_fun._path_audio_(audio_path , save_path,save_path)
109 | 


--------------------------------------------------------------------------------
/logs/mute/0_gt_wavs/mute32k.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/0_gt_wavs/mute32k.wav


--------------------------------------------------------------------------------
/logs/mute/0_gt_wavs/mute40k.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/0_gt_wavs/mute40k.wav


--------------------------------------------------------------------------------
/logs/mute/0_gt_wavs/mute48k.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/0_gt_wavs/mute48k.wav


--------------------------------------------------------------------------------
/logs/mute/1_16k_wavs/mute.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/1_16k_wavs/mute.wav


--------------------------------------------------------------------------------
/logs/mute/2a_f0/mute.wav.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/2a_f0/mute.wav.npy


--------------------------------------------------------------------------------
/logs/mute/2b-f0nsf/mute.wav.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/2b-f0nsf/mute.wav.npy


--------------------------------------------------------------------------------
/logs/mute/3_feature256/mute.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/logs/mute/3_feature256/mute.npy


--------------------------------------------------------------------------------
/my_utils.py:
--------------------------------------------------------------------------------
 1 | import ffmpeg
 2 | import numpy as np
 3 | def load_audio(file,sr):
 4 |     try:
 5 |         # https://github.com/openai/whisper/blob/main/whisper/audio.py#L26
 6 |         # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
 7 |         # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
 8 |         file=file.strip(" ").strip('"').strip("\n").strip('"').strip(" ")#防止小白拷路径头尾带了空格和"和回车
 9 |         out, _ = (
10 |             ffmpeg.input(file, threads=0)
11 |             .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
12 |             .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
13 |         )
14 |     except Exception as e:
15 |         raise RuntimeError(f"Failed to load audio: {e}")
16 | 
17 |     return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
18 | 


--------------------------------------------------------------------------------
/pretrained/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "rvc-beta"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["lj1995"]
 6 | license = "MIT"
 7 | 
 8 | [tool.poetry.dependencies]
 9 | python = "^3.8"
10 | torch = "^2.0.0"
11 | torchaudio = "^2.0.1"
12 | Cython = "^0.29.34"
13 | gradio = "^3.24.1"
14 | future = "^0.18.3"
15 | pydub = "^0.25.1"
16 | soundfile = "^0.12.1"
17 | ffmpeg-python = "^0.2.0"
18 | tensorboardX = "^2.6"
19 | functorch = "^2.0.0"
20 | fairseq = "^0.12.2"
21 | faiss-gpu = "^1.7.2"
22 | faiss-cpu = "^1.7.3"
23 | Jinja2 = "^3.1.2"
24 | json5 = "^0.9.11"
25 | librosa = "0.9.2"
26 | llvmlite = "0.39.0"
27 | Markdown = "^3.4.3"
28 | matplotlib = "^3.7.1"
29 | matplotlib-inline = "^0.1.6"
30 | numba = "0.56.4"
31 | numpy = "1.23.5"
32 | scipy = "1.9.3"
33 | praat-parselmouth = "^0.4.3"
34 | Pillow = "9.1.1"
35 | pyworld = "^0.3.2"
36 | resampy = "^0.4.2"
37 | scikit-learn = "^1.2.2"
38 | starlette = "^0.26.1"
39 | tensorboard = "^2.12.1"
40 | tensorboard-data-server = "^0.7.0"
41 | tensorboard-plugin-wit = "^1.8.1"
42 | torchgen = "^0.0.1"
43 | tqdm = "^4.65.0"
44 | tornado = "^6.2"
45 | Werkzeug = "^2.2.3"
46 | uc-micro-py = "^1.0.1"
47 | sympy = "^1.11.1"
48 | tabulate = "^0.9.0"
49 | PyYAML = "^6.0"
50 | pyasn1 = "^0.4.8"
51 | pyasn1-modules = "^0.2.8"
52 | fsspec = "^2023.3.0"
53 | absl-py = "^1.4.0"
54 | audioread = "^3.0.0"
55 | uvicorn = "^0.21.1"
56 | colorama = "^0.4.6"
57 | 
58 | [tool.poetry.dev-dependencies]
59 | 
60 | [build-system]
61 | requires = ["poetry-core>=1.0.0"]
62 | build-backend = "poetry.core.masonry.api"
63 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numba==0.56.4
 2 | numpy==1.23.5
 3 | scipy==1.9.3
 4 | librosa==0.9.2
 5 | llvmlite==0.39.0
 6 | fairseq==0.12.2
 7 | faiss-cpu==1.7.2
 8 | gradio
 9 | Cython
10 | future>=0.18.3
11 | pydub>=0.25.1
12 | soundfile>=0.12.1
13 | ffmpeg-python>=0.2.0
14 | tensorboardX
15 | functorch>=2.0.0
16 | Jinja2>=3.1.2
17 | json5>=0.9.11
18 | Markdown
19 | matplotlib>=3.7.1
20 | matplotlib-inline>=0.1.6
21 | praat-parselmouth>=0.4.3
22 | Pillow>=9.1.1
23 | pyworld>=0.3.2
24 | resampy>=0.4.2
25 | scikit-learn>=1.2.2
26 | starlette>=0.26.1
27 | tensorboard
28 | tensorboard-data-server
29 | tensorboard-plugin-wit
30 | torchgen>=0.0.1
31 | tqdm>=4.65.0
32 | tornado>=6.2
33 | Werkzeug>=2.2.3
34 | uc-micro-py>=1.0.1
35 | sympy>=1.11.1
36 | tabulate>=0.9.0
37 | PyYAML>=6.0
38 | pyasn1>=0.4.8
39 | pyasn1-modules>=0.2.8
40 | fsspec>=2023.3.0
41 | absl-py>=1.4.0
42 | audioread
43 | uvicorn>=0.21.1
44 | colorama>=0.4.6
45 | 


--------------------------------------------------------------------------------
/slicer2.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | 
  4 | # This function is obtained from librosa.
  5 | def get_rms(
  6 |     y,
  7 |     frame_length=2048,
  8 |     hop_length=512,
  9 |     pad_mode="constant",
 10 | ):
 11 |     padding = (int(frame_length // 2), int(frame_length // 2))
 12 |     y = np.pad(y, padding, mode=pad_mode)
 13 | 
 14 |     axis = -1
 15 |     # put our new within-frame axis at the end for now
 16 |     out_strides = y.strides + tuple([y.strides[axis]])
 17 |     # Reduce the shape on the framing axis
 18 |     x_shape_trimmed = list(y.shape)
 19 |     x_shape_trimmed[axis] -= frame_length - 1
 20 |     out_shape = tuple(x_shape_trimmed) + tuple([frame_length])
 21 |     xw = np.lib.stride_tricks.as_strided(
 22 |         y, shape=out_shape, strides=out_strides
 23 |     )
 24 |     if axis < 0:
 25 |         target_axis = axis - 1
 26 |     else:
 27 |         target_axis = axis + 1
 28 |     xw = np.moveaxis(xw, -1, target_axis)
 29 |     # Downsample along the target axis
 30 |     slices = [slice(None)] * xw.ndim
 31 |     slices[axis] = slice(0, None, hop_length)
 32 |     x = xw[tuple(slices)]
 33 | 
 34 |     # Calculate power
 35 |     power = np.mean(np.abs(x) ** 2, axis=-2, keepdims=True)
 36 | 
 37 |     return np.sqrt(power)
 38 | 
 39 | 
 40 | class Slicer:
 41 |     def __init__(self,
 42 |                  sr: int,
 43 |                  threshold: float = -40.,
 44 |                  min_length: int = 5000,
 45 |                  min_interval: int = 300,
 46 |                  hop_size: int = 20,
 47 |                  max_sil_kept: int = 5000):
 48 |         if not min_length >= min_interval >= hop_size:
 49 |             raise ValueError('The following condition must be satisfied: min_length >= min_interval >= hop_size')
 50 |         if not max_sil_kept >= hop_size:
 51 |             raise ValueError('The following condition must be satisfied: max_sil_kept >= hop_size')
 52 |         min_interval = sr * min_interval / 1000
 53 |         self.threshold = 10 ** (threshold / 20.)
 54 |         self.hop_size = round(sr * hop_size / 1000)
 55 |         self.win_size = min(round(min_interval), 4 * self.hop_size)
 56 |         self.min_length = round(sr * min_length / 1000 / self.hop_size)
 57 |         self.min_interval = round(min_interval / self.hop_size)
 58 |         self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
 59 | 
 60 |     def _apply_slice(self, waveform, begin, end):
 61 |         if len(waveform.shape) > 1:
 62 |             return waveform[:, begin * self.hop_size: min(waveform.shape[1], end * self.hop_size)]
 63 |         else:
 64 |             return waveform[begin * self.hop_size: min(waveform.shape[0], end * self.hop_size)]
 65 | 
 66 |     # @timeit
 67 |     def slice(self, waveform):
 68 |         if len(waveform.shape) > 1:
 69 |             samples = waveform.mean(axis=0)
 70 |         else:
 71 |             samples = waveform
 72 |         if samples.shape[0] <= self.min_length:
 73 |             return [waveform]
 74 |         rms_list = get_rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0)
 75 |         sil_tags = []
 76 |         silence_start = None
 77 |         clip_start = 0
 78 |         for i, rms in enumerate(rms_list):
 79 |             # Keep looping while frame is silent.
 80 |             if rms < self.threshold:
 81 |                 # Record start of silent frames.
 82 |                 if silence_start is None:
 83 |                     silence_start = i
 84 |                 continue
 85 |             # Keep looping while frame is not silent and silence start has not been recorded.
 86 |             if silence_start is None:
 87 |                 continue
 88 |             # Clear recorded silence start if interval is not enough or clip is too short
 89 |             is_leading_silence = silence_start == 0 and i > self.max_sil_kept
 90 |             need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length
 91 |             if not is_leading_silence and not need_slice_middle:
 92 |                 silence_start = None
 93 |                 continue
 94 |             # Need slicing. Record the range of silent frames to be removed.
 95 |             if i - silence_start <= self.max_sil_kept:
 96 |                 pos = rms_list[silence_start: i + 1].argmin() + silence_start
 97 |                 if silence_start == 0:
 98 |                     sil_tags.append((0, pos))
 99 |                 else:
100 |                     sil_tags.append((pos, pos))
101 |                 clip_start = pos
102 |             elif i - silence_start <= self.max_sil_kept * 2:
103 |                 pos = rms_list[i - self.max_sil_kept: silence_start + self.max_sil_kept + 1].argmin()
104 |                 pos += i - self.max_sil_kept
105 |                 pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
106 |                 pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
107 |                 if silence_start == 0:
108 |                     sil_tags.append((0, pos_r))
109 |                     clip_start = pos_r
110 |                 else:
111 |                     sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
112 |                     clip_start = max(pos_r, pos)
113 |             else:
114 |                 pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
115 |                 pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
116 |                 if silence_start == 0:
117 |                     sil_tags.append((0, pos_r))
118 |                 else:
119 |                     sil_tags.append((pos_l, pos_r))
120 |                 clip_start = pos_r
121 |             silence_start = None
122 |         # Deal with trailing silence.
123 |         total_frames = rms_list.shape[0]
124 |         if silence_start is not None and total_frames - silence_start >= self.min_interval:
125 |             silence_end = min(total_frames, silence_start + self.max_sil_kept)
126 |             pos = rms_list[silence_start: silence_end + 1].argmin() + silence_start
127 |             sil_tags.append((pos, total_frames + 1))
128 |         # Apply and return slices.
129 |         if len(sil_tags) == 0:
130 |             return [waveform]
131 |         else:
132 |             chunks = []
133 |             if sil_tags[0][0] > 0:
134 |                 chunks.append(self._apply_slice(waveform, 0, sil_tags[0][0]))
135 |             for i in range(len(sil_tags) - 1):
136 |                 chunks.append(self._apply_slice(waveform, sil_tags[i][1], sil_tags[i + 1][0]))
137 |             if sil_tags[-1][1] < total_frames:
138 |                 chunks.append(self._apply_slice(waveform, sil_tags[-1][1], total_frames))
139 |             return chunks
140 | 
141 | 
142 | def main():
143 |     import os.path
144 |     from argparse import ArgumentParser
145 | 
146 |     import librosa
147 |     import soundfile
148 | 
149 |     parser = ArgumentParser()
150 |     parser.add_argument('audio', type=str, help='The audio to be sliced')
151 |     parser.add_argument('--out', type=str, help='Output directory of the sliced audio clips')
152 |     parser.add_argument('--db_thresh', type=float, required=False, default=-40,
153 |                         help='The dB threshold for silence detection')
154 |     parser.add_argument('--min_length', type=int, required=False, default=5000,
155 |                         help='The minimum milliseconds required for each sliced audio clip')
156 |     parser.add_argument('--min_interval', type=int, required=False, default=300,
157 |                         help='The minimum milliseconds for a silence part to be sliced')
158 |     parser.add_argument('--hop_size', type=int, required=False, default=10,
159 |                         help='Frame length in milliseconds')
160 |     parser.add_argument('--max_sil_kept', type=int, required=False, default=500,
161 |                         help='The maximum silence length kept around the sliced clip, presented in milliseconds')
162 |     args = parser.parse_args()
163 |     out = args.out
164 |     if out is None:
165 |         out = os.path.dirname(os.path.abspath(args.audio))
166 |     audio, sr = librosa.load(args.audio, sr=None, mono=False)
167 |     slicer = Slicer(
168 |         sr=sr,
169 |         threshold=args.db_thresh,
170 |         min_length=args.min_length,
171 |         min_interval=args.min_interval,
172 |         hop_size=args.hop_size,
173 |         max_sil_kept=args.max_sil_kept
174 |     )
175 |     chunks = slicer.slice(audio)
176 |     if not os.path.exists(out):
177 |         os.makedirs(out)
178 |     for i, chunk in enumerate(chunks):
179 |         if len(chunk.shape) > 1:
180 |             chunk = chunk.T
181 |         soundfile.write(os.path.join(out, f'%s_%d.wav' % (os.path.basename(args.audio).rsplit('.', maxsplit=1)[0], i)), chunk, sr)
182 | 
183 | 
184 | if __name__ == '__main__':
185 |     main()


--------------------------------------------------------------------------------
/train/cmd.txt:
--------------------------------------------------------------------------------
1 | python train_nsf_sim_cache_sid.py -c configs/mi_mix40k_nsf_co256_cs1sid_ms2048.json -m ft-mi


--------------------------------------------------------------------------------
/train/losses.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch.nn import functional as F
 3 | 
 4 | def feature_loss(fmap_r, fmap_g):
 5 |     loss = 0
 6 |     for dr, dg in zip(fmap_r, fmap_g):
 7 |         for rl, gl in zip(dr, dg):
 8 |             rl = rl.float().detach()
 9 |             gl = gl.float()
10 |             loss += torch.mean(torch.abs(rl - gl))
11 | 
12 |     return loss * 2
13 | 
14 | 
15 | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
16 |     loss = 0
17 |     r_losses = []
18 |     g_losses = []
19 |     for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
20 |         dr = dr.float()
21 |         dg = dg.float()
22 |         r_loss = torch.mean((1 - dr) ** 2)
23 |         g_loss = torch.mean(dg**2)
24 |         loss += r_loss + g_loss
25 |         r_losses.append(r_loss.item())
26 |         g_losses.append(g_loss.item())
27 | 
28 |     return loss, r_losses, g_losses
29 | 
30 | 
31 | def generator_loss(disc_outputs):
32 |     loss = 0
33 |     gen_losses = []
34 |     for dg in disc_outputs:
35 |         dg = dg.float()
36 |         l = torch.mean((1 - dg) ** 2)
37 |         gen_losses.append(l)
38 |         loss += l
39 | 
40 |     return loss, gen_losses
41 | 
42 | 
43 | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
44 |     """
45 |     z_p, logs_q: [b, h, t_t]
46 |     m_p, logs_p: [b, h, t_t]
47 |     """
48 |     z_p = z_p.float()
49 |     logs_q = logs_q.float()
50 |     m_p = m_p.float()
51 |     logs_p = logs_p.float()
52 |     z_mask = z_mask.float()
53 | 
54 |     kl = logs_p - logs_q - 0.5
55 |     kl += 0.5 * ((z_p - m_p) ** 2) * torch.exp(-2.0 * logs_p)
56 |     kl = torch.sum(kl * z_mask)
57 |     l = kl / torch.sum(z_mask)
58 |     return l
59 | 


--------------------------------------------------------------------------------
/train/mel_processing.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import os
  3 | import random
  4 | import torch
  5 | from torch import nn
  6 | import torch.nn.functional as F
  7 | import torch.utils.data
  8 | import numpy as np
  9 | import librosa
 10 | import librosa.util as librosa_util
 11 | from librosa.util import normalize, pad_center, tiny
 12 | from scipy.signal import get_window
 13 | from scipy.io.wavfile import read
 14 | from librosa.filters import mel as librosa_mel_fn
 15 | 
 16 | MAX_WAV_VALUE = 32768.0
 17 | 
 18 | 
 19 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
 20 |     """
 21 |     PARAMS
 22 |     ------
 23 |     C: compression factor
 24 |     """
 25 |     return torch.log(torch.clamp(x, min=clip_val) * C)
 26 | 
 27 | 
 28 | def dynamic_range_decompression_torch(x, C=1):
 29 |     """
 30 |     PARAMS
 31 |     ------
 32 |     C: compression factor used to compress
 33 |     """
 34 |     return torch.exp(x) / C
 35 | 
 36 | 
 37 | def spectral_normalize_torch(magnitudes):
 38 |     output = dynamic_range_compression_torch(magnitudes)
 39 |     return output
 40 | 
 41 | 
 42 | def spectral_de_normalize_torch(magnitudes):
 43 |     output = dynamic_range_decompression_torch(magnitudes)
 44 |     return output
 45 | 
 46 | 
 47 | mel_basis = {}
 48 | hann_window = {}
 49 | 
 50 | 
 51 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
 52 |     if torch.min(y) < -1.0:
 53 |         print("min value is ", torch.min(y))
 54 |     if torch.max(y) > 1.0:
 55 |         print("max value is ", torch.max(y))
 56 | 
 57 |     global hann_window
 58 |     dtype_device = str(y.dtype) + "_" + str(y.device)
 59 |     wnsize_dtype_device = str(win_size) + "_" + dtype_device
 60 |     if wnsize_dtype_device not in hann_window:
 61 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
 62 |             dtype=y.dtype, device=y.device
 63 |         )
 64 | 
 65 |     y = torch.nn.functional.pad(
 66 |         y.unsqueeze(1),
 67 |         (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
 68 |         mode="reflect",
 69 |     )
 70 |     y = y.squeeze(1)
 71 | 
 72 |     spec = torch.stft(
 73 |         y,
 74 |         n_fft,
 75 |         hop_length=hop_size,
 76 |         win_length=win_size,
 77 |         window=hann_window[wnsize_dtype_device],
 78 |         center=center,
 79 |         pad_mode="reflect",
 80 |         normalized=False,
 81 |         onesided=True,return_complex=False
 82 |     )
 83 | 
 84 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
 85 |     return spec
 86 | 
 87 | 
 88 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
 89 |     global mel_basis
 90 |     dtype_device = str(spec.dtype) + "_" + str(spec.device)
 91 |     fmax_dtype_device = str(fmax) + "_" + dtype_device
 92 |     if fmax_dtype_device not in mel_basis:
 93 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
 94 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
 95 |             dtype=spec.dtype, device=spec.device
 96 |         )
 97 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
 98 |     spec = spectral_normalize_torch(spec)
 99 |     return spec
100 | 
101 | 
102 | def mel_spectrogram_torch(
103 |     y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
104 | ):
105 |     if torch.min(y) < -1.0:
106 |         print("min value is ", torch.min(y))
107 |     if torch.max(y) > 1.0:
108 |         print("max value is ", torch.max(y))
109 | 
110 |     global mel_basis, hann_window
111 |     dtype_device = str(y.dtype) + "_" + str(y.device)
112 |     fmax_dtype_device = str(fmax) + "_" + dtype_device
113 |     wnsize_dtype_device = str(win_size) + "_" + dtype_device
114 |     if fmax_dtype_device not in mel_basis:
115 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
116 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
117 |             dtype=y.dtype, device=y.device
118 |         )
119 |     if wnsize_dtype_device not in hann_window:
120 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
121 |             dtype=y.dtype, device=y.device
122 |         )
123 | 
124 |     y = torch.nn.functional.pad(
125 |         y.unsqueeze(1),
126 |         (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
127 |         mode="reflect",
128 |     )
129 |     y = y.squeeze(1)
130 | 
131 |     # spec = torch.stft(
132 |     #     y,
133 |     #     n_fft,
134 |     #     hop_length=hop_size,
135 |     #     win_length=win_size,
136 |     #     window=hann_window[wnsize_dtype_device],
137 |     #     center=center,
138 |     #     pad_mode="reflect",
139 |     #     normalized=False,
140 |     #     onesided=True,
141 |     # )
142 |     spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
143 |                       center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
144 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
145 | 
146 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
147 |     spec = spectral_normalize_torch(spec)
148 | 
149 |     return spec
150 | 


--------------------------------------------------------------------------------
/train/process_ckpt.py:
--------------------------------------------------------------------------------
  1 | import torch,traceback,os,pdb
  2 | from collections import OrderedDict
  3 | 
  4 | def savee(ckpt,sr,if_f0,name,epoch):
  5 |     try:
  6 |         opt = OrderedDict()
  7 |         opt["weight"] = {}
  8 |         for key in ckpt.keys():
  9 |             if ("enc_q" in key): continue
 10 |             opt["weight"][key] = ckpt[key].half()
 11 |         if(sr=="40k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000]
 12 |         elif(sr=="48k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,6,2,2,2], 512, [16, 16, 4, 4,4], 109, 256, 48000]
 13 |         elif(sr=="32k"):opt["config"] = [513, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 4, 2, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 32000]
 14 |         opt["info"] = "%sepoch"%epoch
 15 |         opt["sr"] = sr
 16 |         opt["f0"] =if_f0
 17 |         torch.save(opt, "weights/%s.pth"%name)
 18 |         return "Success."
 19 |     except:
 20 |         return traceback.format_exc()
 21 | 
 22 | def show_info(path):
 23 |     try:
 24 |         a = torch.load(path, map_location="cpu")
 25 |         return "模型信息:%s\n采样率:%s\n模型是否输入音高引导:%s"%(a.get("info","None"),a.get("sr","None"),a.get("f0","None"),)
 26 |     except:
 27 |         return traceback.format_exc()
 28 | 
 29 | def extract_small_model(path,name,sr,if_f0,info):
 30 |     try:
 31 |         ckpt = torch.load(path, map_location="cpu")
 32 |         if("model"in ckpt):ckpt=ckpt["model"]
 33 |         opt = OrderedDict()
 34 |         opt["weight"] = {}
 35 |         for key in ckpt.keys():
 36 |             if ("enc_q" in key): continue
 37 |             opt["weight"][key] = ckpt[key].half()
 38 |         if(sr=="40k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 10, 2, 2], 512, [16, 16, 4, 4], 109, 256, 40000]
 39 |         elif(sr=="48k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,6,2,2,2], 512, [16, 16, 4, 4,4], 109, 256, 48000]
 40 |         elif(sr=="32k"):opt["config"] = [513, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 4, 2, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 32000]
 41 |         if(info==""):info="Extracted model."
 42 |         opt["info"] = info
 43 |         opt["sr"] = sr
 44 |         opt["f0"] =int(if_f0)
 45 |         torch.save(opt, "weights/%s.pth"%name)
 46 |         return "Success."
 47 |     except:
 48 |         return traceback.format_exc()
 49 | 
 50 | def change_info(path,info,name):
 51 |     try:
 52 |         ckpt = torch.load(path, map_location="cpu")
 53 |         ckpt["info"]=info
 54 |         if(name==""):name=os.path.basename(path)
 55 |         torch.save(ckpt, "weights/%s"%name)
 56 |         return "Success."
 57 |     except:
 58 |         return traceback.format_exc()
 59 | 
 60 | def merge(path1,path2,alpha1,sr,f0,info,name):
 61 |     try:
 62 |         def extract(ckpt):
 63 |             a = ckpt["model"]
 64 |             opt = OrderedDict()
 65 |             opt["weight"] = {}
 66 |             for key in a.keys():
 67 |                 if ("enc_q" in key): continue
 68 |                 opt["weight"][key] = a[key]
 69 |             return opt
 70 |         ckpt1 = torch.load(path1, map_location="cpu")
 71 |         ckpt2 = torch.load(path2, map_location="cpu")
 72 |         cfg = ckpt1["config"]
 73 |         if("model"in ckpt1): ckpt1=extract(ckpt1)
 74 |         else: ckpt1=ckpt1["weight"]
 75 |         if("model"in ckpt2): ckpt2=extract(ckpt2)
 76 |         else: ckpt2=ckpt2["weight"]
 77 |         if(sorted(list(ckpt1.keys()))!=sorted(list(ckpt2.keys()))):return "Fail to merge the models. The model architectures are not the same."
 78 |         opt = OrderedDict()
 79 |         opt["weight"] = {}
 80 |         for key in ckpt1.keys():
 81 |             # try:
 82 |                 if(key=="emb_g.weight"and ckpt1[key].shape!=ckpt2[key].shape):
 83 |                     min_shape0=min(ckpt1[key].shape[0],ckpt2[key].shape[0])
 84 |                     opt["weight"][key] = (alpha1 * (ckpt1[key][:min_shape0].float()) + (1 - alpha1) * (ckpt2[key][:min_shape0].float())).half()
 85 |                 else:
 86 |                     opt["weight"][key] = (alpha1*(ckpt1[key].float())+(1-alpha1)*(ckpt2[key].float())).half()
 87 |             # except:
 88 |             #     pdb.set_trace()
 89 |         opt["config"] = cfg
 90 |         '''
 91 |         if(sr=="40k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 10, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 40000]
 92 |         elif(sr=="48k"):opt["config"] = [1025, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10,6,2,2,2], 512, [16, 16, 4, 4], 109, 256, 48000]
 93 |         elif(sr=="32k"):opt["config"] = [513, 32, 192, 192, 768, 2, 6, 3, 0, "1", [3, 7, 11], [[1, 3, 5], [1, 3, 5], [1, 3, 5]], [10, 4, 2, 2, 2], 512, [16, 16, 4, 4,4], 109, 256, 32000]
 94 |         '''
 95 |         opt["sr"]=sr
 96 |         opt["f0"]=1 if f0=="是"else 0
 97 |         opt["info"]=info
 98 |         torch.save(opt, "weights/%s.pth"%name)
 99 |         return "Success."
100 |     except:
101 |         return traceback.format_exc()
102 | 


--------------------------------------------------------------------------------
/train/utils.py:
--------------------------------------------------------------------------------
  1 | import os,traceback
  2 | import glob
  3 | import sys
  4 | import argparse
  5 | import logging
  6 | import json
  7 | import subprocess
  8 | import numpy as np
  9 | from scipy.io.wavfile import read
 10 | import torch
 11 | 
 12 | MATPLOTLIB_FLAG = False
 13 | 
 14 | logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
 15 | logger = logging
 16 | 
 17 | def load_checkpoint_d(checkpoint_path, combd,sbd, optimizer=None,load_opt=1):
 18 |   assert os.path.isfile(checkpoint_path)
 19 |   checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 20 | 
 21 |   ##################
 22 |   def go(model,bkey):
 23 |     saved_state_dict = checkpoint_dict[bkey]
 24 |     if hasattr(model, 'module'):state_dict = model.module.state_dict()
 25 |     else:state_dict = model.state_dict()
 26 |     new_state_dict= {}
 27 |     for k, v in state_dict.items():#模型需要的shape
 28 |       try:
 29 |         new_state_dict[k] = saved_state_dict[k]
 30 |         if(saved_state_dict[k].shape!=state_dict[k].shape):
 31 |           print("shape-%s-mismatch|need-%s|get-%s"%(k,state_dict[k].shape,saved_state_dict[k].shape))#
 32 |           raise KeyError
 33 |       except:
 34 |         # logger.info(traceback.format_exc())
 35 |         logger.info("%s is not in the checkpoint" % k)#pretrain缺失的
 36 |         new_state_dict[k] = v#模型自带的随机值
 37 |     if hasattr(model, 'module'):
 38 |       model.module.load_state_dict(new_state_dict,strict=False)
 39 |     else:
 40 |       model.load_state_dict(new_state_dict,strict=False)
 41 |   go(combd,"combd")
 42 |   go(sbd,"sbd")
 43 |   #############
 44 |   logger.info("Loaded model weights")
 45 | 
 46 |   iteration = checkpoint_dict['iteration']
 47 |   learning_rate = checkpoint_dict['learning_rate']
 48 |   if optimizer is not None and load_opt==1:###加载不了，如果是空的的话，重新初始化，可能还会影响lr时间表的更新，因此在train文件最外围catch
 49 |   #   try:
 50 |       optimizer.load_state_dict(checkpoint_dict['optimizer'])
 51 |   #   except:
 52 |   #     traceback.print_exc()
 53 |   logger.info("Loaded checkpoint '{}' (epoch {})" .format(checkpoint_path, iteration))
 54 |   return model, optimizer, learning_rate, iteration
 55 | 
 56 | 
 57 | # def load_checkpoint(checkpoint_path, model, optimizer=None):
 58 | #   assert os.path.isfile(checkpoint_path)
 59 | #   checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 60 | #   iteration = checkpoint_dict['iteration']
 61 | #   learning_rate = checkpoint_dict['learning_rate']
 62 | #   if optimizer is not None:
 63 | #     optimizer.load_state_dict(checkpoint_dict['optimizer'])
 64 | #   # print(1111)
 65 | #   saved_state_dict = checkpoint_dict['model']
 66 | #   # print(1111)
 67 | #
 68 | #   if hasattr(model, 'module'):
 69 | #     state_dict = model.module.state_dict()
 70 | #   else:
 71 | #     state_dict = model.state_dict()
 72 | #   new_state_dict= {}
 73 | #   for k, v in state_dict.items():
 74 | #     try:
 75 | #       new_state_dict[k] = saved_state_dict[k]
 76 | #     except:
 77 | #       logger.info("%s is not in the checkpoint" % k)
 78 | #       new_state_dict[k] = v
 79 | #   if hasattr(model, 'module'):
 80 | #     model.module.load_state_dict(new_state_dict)
 81 | #   else:
 82 | #     model.load_state_dict(new_state_dict)
 83 | #   logger.info("Loaded checkpoint '{}' (epoch {})" .format(
 84 | #     checkpoint_path, iteration))
 85 | #   return model, optimizer, learning_rate, iteration
 86 | def load_checkpoint(checkpoint_path, model, optimizer=None,load_opt=1):
 87 |   assert os.path.isfile(checkpoint_path)
 88 |   checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
 89 | 
 90 |   saved_state_dict = checkpoint_dict['model']
 91 |   if hasattr(model, 'module'):
 92 |     state_dict = model.module.state_dict()
 93 |   else:
 94 |     state_dict = model.state_dict()
 95 |   new_state_dict= {}
 96 |   for k, v in state_dict.items():#模型需要的shape
 97 |     try:
 98 |       new_state_dict[k] = saved_state_dict[k]
 99 |       if(saved_state_dict[k].shape!=state_dict[k].shape):
100 |         print("shape-%s-mismatch|need-%s|get-%s"%(k,state_dict[k].shape,saved_state_dict[k].shape))#
101 |         raise KeyError
102 |     except:
103 |       # logger.info(traceback.format_exc())
104 |       logger.info("%s is not in the checkpoint" % k)#pretrain缺失的
105 |       new_state_dict[k] = v#模型自带的随机值
106 |   if hasattr(model, 'module'):
107 |     model.module.load_state_dict(new_state_dict,strict=False)
108 |   else:
109 |     model.load_state_dict(new_state_dict,strict=False)
110 |   logger.info("Loaded model weights")
111 | 
112 |   iteration = checkpoint_dict['iteration']
113 |   learning_rate = checkpoint_dict['learning_rate']
114 |   if optimizer is not None and load_opt==1:###加载不了，如果是空的的话，重新初始化，可能还会影响lr时间表的更新，因此在train文件最外围catch
115 |   #   try:
116 |       optimizer.load_state_dict(checkpoint_dict['optimizer'])
117 |   #   except:
118 |   #     traceback.print_exc()
119 |   logger.info("Loaded checkpoint '{}' (epoch {})" .format(checkpoint_path, iteration))
120 |   return model, optimizer, learning_rate, iteration
121 | 
122 | 
123 | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path):
124 |   logger.info("Saving model and optimizer state at epoch {} to {}".format(
125 |     iteration, checkpoint_path))
126 |   if hasattr(model, 'module'):
127 |     state_dict = model.module.state_dict()
128 |   else:
129 |     state_dict = model.state_dict()
130 |   torch.save({'model': state_dict,
131 |               'iteration': iteration,
132 |               'optimizer': optimizer.state_dict(),
133 |               'learning_rate': learning_rate}, checkpoint_path)
134 | def save_checkpoint_d(combd, sbd, optimizer, learning_rate, iteration, checkpoint_path):
135 |   logger.info("Saving model and optimizer state at epoch {} to {}".format(
136 |     iteration, checkpoint_path))
137 |   if hasattr(combd, 'module'): state_dict_combd = combd.module.state_dict()
138 |   else:state_dict_combd = combd.state_dict()
139 |   if hasattr(sbd, 'module'): state_dict_sbd = sbd.module.state_dict()
140 |   else:state_dict_sbd = sbd.state_dict()
141 |   torch.save({
142 |               'combd': state_dict_combd,
143 |               'sbd': state_dict_sbd,
144 |               'iteration': iteration,
145 |               'optimizer': optimizer.state_dict(),
146 |               'learning_rate': learning_rate}, checkpoint_path)
147 | 
148 | 
149 | def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050):
150 |   for k, v in scalars.items():
151 |     writer.add_scalar(k, v, global_step)
152 |   for k, v in histograms.items():
153 |     writer.add_histogram(k, v, global_step)
154 |   for k, v in images.items():
155 |     writer.add_image(k, v, global_step, dataformats='HWC')
156 |   for k, v in audios.items():
157 |     writer.add_audio(k, v, global_step, audio_sampling_rate)
158 | 
159 | 
160 | def latest_checkpoint_path(dir_path, regex="G_*.pth"):
161 |   f_list = glob.glob(os.path.join(dir_path, regex))
162 |   f_list.sort(key=lambda f: int("".join(filter(str.isdigit, f))))
163 |   x = f_list[-1]
164 |   print(x)
165 |   return x
166 | 
167 | 
168 | def plot_spectrogram_to_numpy(spectrogram):
169 |   global MATPLOTLIB_FLAG
170 |   if not MATPLOTLIB_FLAG:
171 |     import matplotlib
172 |     matplotlib.use("Agg")
173 |     MATPLOTLIB_FLAG = True
174 |     mpl_logger = logging.getLogger('matplotlib')
175 |     mpl_logger.setLevel(logging.WARNING)
176 |   import matplotlib.pylab as plt
177 |   import numpy as np
178 |   
179 |   fig, ax = plt.subplots(figsize=(10,2))
180 |   im = ax.imshow(spectrogram, aspect="auto", origin="lower",
181 |                   interpolation='none')
182 |   plt.colorbar(im, ax=ax)
183 |   plt.xlabel("Frames")
184 |   plt.ylabel("Channels")
185 |   plt.tight_layout()
186 | 
187 |   fig.canvas.draw()
188 |   data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
189 |   data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
190 |   plt.close()
191 |   return data
192 | 
193 | 
194 | def plot_alignment_to_numpy(alignment, info=None):
195 |   global MATPLOTLIB_FLAG
196 |   if not MATPLOTLIB_FLAG:
197 |     import matplotlib
198 |     matplotlib.use("Agg")
199 |     MATPLOTLIB_FLAG = True
200 |     mpl_logger = logging.getLogger('matplotlib')
201 |     mpl_logger.setLevel(logging.WARNING)
202 |   import matplotlib.pylab as plt
203 |   import numpy as np
204 | 
205 |   fig, ax = plt.subplots(figsize=(6, 4))
206 |   im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower',
207 |                   interpolation='none')
208 |   fig.colorbar(im, ax=ax)
209 |   xlabel = 'Decoder timestep'
210 |   if info is not None:
211 |       xlabel += '\n\n' + info
212 |   plt.xlabel(xlabel)
213 |   plt.ylabel('Encoder timestep')
214 |   plt.tight_layout()
215 | 
216 |   fig.canvas.draw()
217 |   data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
218 |   data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
219 |   plt.close()
220 |   return data
221 | 
222 | 
223 | def load_wav_to_torch(full_path):
224 |   sampling_rate, data = read(full_path)
225 |   return torch.FloatTensor(data.astype(np.float32)), sampling_rate
226 | 
227 | 
228 | def load_filepaths_and_text(filename, split="|"):
229 |   with open(filename, encoding='utf-8') as f:
230 |     filepaths_and_text = [line.strip().split(split) for line in f]
231 |   return filepaths_and_text
232 | 
233 | 
234 | def get_hparams(init=True):
235 |   '''
236 | todo:
237 |   结尾七人组：
238 |     保存频率、总epoch                     done
239 |     bs                                    done
240 |     pretrainG、pretrainD                  done
241 |     卡号：os.en["CUDA_VISIBLE_DEVICES"]   done
242 |     if_latest                             todo
243 |   模型：if_f0                             todo
244 |   采样率：自动选择config                  done
245 |   是否缓存数据集进GPU:if_cache_data_in_gpu done
246 | 
247 |   -m:
248 |     自动决定training_files路径,改掉train_nsf_load_pretrain.py里的hps.data.training_files    done
249 |   -c不要了
250 |   '''
251 |   parser = argparse.ArgumentParser()
252 |   # parser.add_argument('-c', '--config', type=str, default="configs/40k.json",help='JSON file for configuration')
253 |   parser.add_argument('-se', '--save_every_epoch', type=int, required=True,help='checkpoint save frequency (epoch)')
254 |   parser.add_argument('-te', '--total_epoch', type=int, required=True,help='total_epoch')
255 |   parser.add_argument('-pg', '--pretrainG', type=str, default="",help='Pretrained Discriminator path')
256 |   parser.add_argument('-pd', '--pretrainD', type=str, default="",help='Pretrained Generator path')
257 |   parser.add_argument('-g', '--gpus', type=str, default="0",help='split by -')
258 |   parser.add_argument('-bs', '--batch_size', type=int, required=True,help='batch size')
259 |   parser.add_argument('-e', '--experiment_dir', type=str, required=True,help='experiment dir')#-m
260 |   parser.add_argument('-sr', '--sample_rate', type=str, required=True,help='sample rate, 32k/40k/48k')
261 |   parser.add_argument('-f0', '--if_f0', type=int, required=True,help='use f0 as one of the inputs of the model, 1 or 0')
262 |   parser.add_argument('-l', '--if_latest', type=int, required=True,help='if only save the latest G/D pth file, 1 or 0')
263 |   parser.add_argument('-c', '--if_cache_data_in_gpu', type=int, required=True,help='if caching the dataset in GPU memory, 1 or 0')
264 | 
265 |   args = parser.parse_args()
266 |   name = args.experiment_dir
267 |   experiment_dir = os.path.join("./logs", args.experiment_dir)
268 | 
269 |   if not os.path.exists(experiment_dir):
270 |     os.makedirs(experiment_dir)
271 | 
272 |   config_path = "configs/%s.json"%args.sample_rate
273 |   config_save_path = os.path.join(experiment_dir, "config.json")
274 |   if init:
275 |     with open(config_path, "r") as f:
276 |       data = f.read()
277 |     with open(config_save_path, "w") as f:
278 |       f.write(data)
279 |   else:
280 |     with open(config_save_path, "r") as f:
281 |       data = f.read()
282 |   config = json.loads(data)
283 | 
284 |   hparams = HParams(**config)
285 |   hparams.model_dir = hparams.experiment_dir = experiment_dir
286 |   hparams.save_every_epoch = args.save_every_epoch
287 |   hparams.name = name
288 |   hparams.total_epoch = args.total_epoch
289 |   hparams.pretrainG = args.pretrainG
290 |   hparams.pretrainD = args.pretrainD
291 |   hparams.gpus = args.gpus
292 |   hparams.train.batch_size = args.batch_size
293 |   hparams.sample_rate = args.sample_rate
294 |   hparams.if_f0 = args.if_f0
295 |   hparams.if_latest = args.if_latest
296 |   hparams.if_cache_data_in_gpu = args.if_cache_data_in_gpu
297 |   hparams.data.training_files = "%s/filelist.txt"%experiment_dir
298 |   return hparams
299 | 
300 | 
301 | def get_hparams_from_dir(model_dir):
302 |   config_save_path = os.path.join(model_dir, "config.json")
303 |   with open(config_save_path, "r") as f:
304 |     data = f.read()
305 |   config = json.loads(data)
306 | 
307 |   hparams =HParams(**config)
308 |   hparams.model_dir = model_dir
309 |   return hparams
310 | 
311 | 
312 | def get_hparams_from_file(config_path):
313 |   with open(config_path, "r") as f:
314 |     data = f.read()
315 |   config = json.loads(data)
316 | 
317 |   hparams =HParams(**config)
318 |   return hparams
319 | 
320 | 
321 | def check_git_hash(model_dir):
322 |   source_dir = os.path.dirname(os.path.realpath(__file__))
323 |   if not os.path.exists(os.path.join(source_dir, ".git")):
324 |     logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format(
325 |       source_dir
326 |     ))
327 |     return
328 | 
329 |   cur_hash = subprocess.getoutput("git rev-parse HEAD")
330 | 
331 |   path = os.path.join(model_dir, "githash")
332 |   if os.path.exists(path):
333 |     saved_hash = open(path).read()
334 |     if saved_hash != cur_hash:
335 |       logger.warn("git hash values are different. {}(saved) != {}(current)".format(
336 |         saved_hash[:8], cur_hash[:8]))
337 |   else:
338 |     open(path, "w").write(cur_hash)
339 | 
340 | 
341 | def get_logger(model_dir, filename="train.log"):
342 |   global logger
343 |   logger = logging.getLogger(os.path.basename(model_dir))
344 |   logger.setLevel(logging.DEBUG)
345 |   
346 |   formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s")
347 |   if not os.path.exists(model_dir):
348 |     os.makedirs(model_dir)
349 |   h = logging.FileHandler(os.path.join(model_dir, filename))
350 |   h.setLevel(logging.DEBUG)
351 |   h.setFormatter(formatter)
352 |   logger.addHandler(h)
353 |   return logger
354 | 
355 | 
356 | class HParams():
357 |   def __init__(self, **kwargs):
358 |     for k, v in kwargs.items():
359 |       if type(v) == dict:
360 |         v = HParams(**v)
361 |       self[k] = v
362 |     
363 |   def keys(self):
364 |     return self.__dict__.keys()
365 | 
366 |   def items(self):
367 |     return self.__dict__.items()
368 | 
369 |   def values(self):
370 |     return self.__dict__.values()
371 | 
372 |   def __len__(self):
373 |     return len(self.__dict__)
374 | 
375 |   def __getitem__(self, key):
376 |     return getattr(self, key)
377 | 
378 |   def __setitem__(self, key, value):
379 |     return setattr(self, key, value)
380 | 
381 |   def __contains__(self, key):
382 |     return key in self.__dict__
383 | 
384 |   def __repr__(self):
385 |     return self.__dict__.__repr__()
386 | 


--------------------------------------------------------------------------------
/trainset_preprocess_pipeline_print.py:
--------------------------------------------------------------------------------
  1 | import sys,os,multiprocessing
  2 | now_dir=os.getcwd()
  3 | sys.path.append(now_dir)
  4 | 
  5 | inp_root = sys.argv[1]
  6 | sr = int(sys.argv[2])
  7 | n_p = int(sys.argv[3])
  8 | exp_dir = sys.argv[4]
  9 | noparallel = sys.argv[5] == "True"
 10 | import numpy as np,os,traceback
 11 | from slicer2 import Slicer
 12 | import librosa,traceback
 13 | from  scipy.io import wavfile
 14 | import multiprocessing
 15 | from my_utils import load_audio
 16 | 
 17 | mutex = multiprocessing.Lock()
 18 | 
 19 | class PreProcess():
 20 |     def __init__(self,sr,exp_dir):
 21 |         self.slicer = Slicer(
 22 |             sr=sr,
 23 |             threshold=-32,
 24 |             min_length=800,
 25 |             min_interval=400,
 26 |             hop_size=15,
 27 |             max_sil_kept=150
 28 |         )
 29 |         self.sr=sr
 30 |         self.per=3.7
 31 |         self.overlap=0.3
 32 |         self.tail=self.per+self.overlap
 33 |         self.max=0.95
 34 |         self.alpha=0.8
 35 |         self.exp_dir=exp_dir
 36 |         self.gt_wavs_dir="%s/0_gt_wavs"%exp_dir
 37 |         self.wavs16k_dir="%s/1_16k_wavs"%exp_dir
 38 |         self.f = open("%s/preprocess.log"%exp_dir, "a+")
 39 |         os.makedirs(self.exp_dir,exist_ok=True)
 40 |         os.makedirs(self.gt_wavs_dir,exist_ok=True)
 41 |         os.makedirs(self.wavs16k_dir,exist_ok=True)
 42 | 
 43 |     def print(self, strr):
 44 |         mutex.acquire()
 45 |         print(strr)
 46 |         self.f.write("%s\n" % strr)
 47 |         self.f.flush()
 48 |         mutex.release()
 49 | 
 50 |     def norm_write(self,tmp_audio,idx0,idx1):
 51 |         tmp_audio = (tmp_audio / np.abs(tmp_audio).max() * (self.max * self.alpha)) + (1 - self.alpha) * tmp_audio
 52 |         wavfile.write("%s/%s_%s.wav" % (self.gt_wavs_dir, idx0, idx1), self.sr, (tmp_audio*32768).astype(np.int16))
 53 |         tmp_audio = librosa.resample(tmp_audio, orig_sr=self.sr, target_sr=16000)
 54 |         wavfile.write("%s/%s_%s.wav" % (self.wavs16k_dir, idx0, idx1), 16000, (tmp_audio*32768).astype(np.int16))
 55 | 
 56 |     def pipeline(self,path, idx0):
 57 |         try:
 58 |             audio = load_audio(path,self.sr)
 59 |             idx1=0
 60 |             for audio in self.slicer.slice(audio):
 61 |                 i = 0
 62 |                 while (1):
 63 |                     start = int(self.sr * (self.per - self.overlap) * i)
 64 |                     i += 1
 65 |                     if (len(audio[start:]) > self.tail * self.sr):
 66 |                         tmp_audio = audio[start:start + int(self.per * self.sr)]
 67 |                         self.norm_write(tmp_audio,idx0,idx1)
 68 |                         idx1 += 1
 69 |                     else:
 70 |                         tmp_audio = audio[start:]
 71 |                         break
 72 |                 self.norm_write(tmp_audio, idx0, idx1)
 73 |             self.print("%s->Suc."%path)
 74 |         except:
 75 |             self.print("%s->%s"%(path,traceback.format_exc()))
 76 | 
 77 |     def pipeline_mp(self,infos):
 78 |         for path, idx0 in infos:
 79 |             self.pipeline(path,idx0)
 80 | 
 81 |     def pipeline_mp_inp_dir(self,inp_root,n_p):
 82 |         try:
 83 |             infos = [("%s/%s" % (inp_root, name), idx) for idx, name in enumerate(sorted(list(os.listdir(inp_root))))]
 84 |             if noparallel:
 85 |                 for i in range(n_p): self.pipeline_mp(infos[i::n_p])
 86 |             else:
 87 |                 ps=[]
 88 |                 for i in range(n_p):
 89 |                     p=multiprocessing.Process(target=self.pipeline_mp,args=(infos[i::n_p],))
 90 |                     p.start()
 91 |                     ps.append(p)
 92 |                     for p in ps:p.join()
 93 |         except:
 94 |             self.print("Fail. %s"%traceback.format_exc())
 95 | 
 96 | def preprocess_trainset(inp_root, sr, n_p, exp_dir):
 97 |     pp=PreProcess(sr,exp_dir)
 98 |     pp.print("start preprocess")
 99 |     pp.print(sys.argv)
100 |     pp.pipeline_mp_inp_dir(inp_root,n_p)
101 |     pp.print("end preprocess")
102 | 
103 | if __name__=='__main__':
104 |     preprocess_trainset(inp_root, sr, n_p, exp_dir)
105 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/dataset.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import random
  3 | 
  4 | import numpy as np
  5 | import torch
  6 | import torch.utils.data
  7 | from tqdm import tqdm
  8 | 
  9 | from uvr5_pack.lib_v5 import spec_utils
 10 | 
 11 | 
 12 | class VocalRemoverValidationSet(torch.utils.data.Dataset):
 13 | 
 14 |     def __init__(self, patch_list):
 15 |         self.patch_list = patch_list
 16 | 
 17 |     def __len__(self):
 18 |         return len(self.patch_list)
 19 | 
 20 |     def __getitem__(self, idx):
 21 |         path = self.patch_list[idx]
 22 |         data = np.load(path)
 23 | 
 24 |         X, y = data['X'], data['y']
 25 | 
 26 |         X_mag = np.abs(X)
 27 |         y_mag = np.abs(y)
 28 | 
 29 |         return X_mag, y_mag
 30 | 
 31 | 
 32 | def make_pair(mix_dir, inst_dir):
 33 |     input_exts = ['.wav', '.m4a', '.mp3', '.mp4', '.flac']
 34 | 
 35 |     X_list = sorted([
 36 |         os.path.join(mix_dir, fname)
 37 |         for fname in os.listdir(mix_dir)
 38 |         if os.path.splitext(fname)[1] in input_exts])
 39 |     y_list = sorted([
 40 |         os.path.join(inst_dir, fname)
 41 |         for fname in os.listdir(inst_dir)
 42 |         if os.path.splitext(fname)[1] in input_exts])
 43 | 
 44 |     filelist = list(zip(X_list, y_list))
 45 | 
 46 |     return filelist
 47 | 
 48 | 
 49 | def train_val_split(dataset_dir, split_mode, val_rate, val_filelist):
 50 |     if split_mode == 'random':
 51 |         filelist = make_pair(
 52 |             os.path.join(dataset_dir, 'mixtures'),
 53 |             os.path.join(dataset_dir, 'instruments'))
 54 | 
 55 |         random.shuffle(filelist)
 56 | 
 57 |         if len(val_filelist) == 0:
 58 |             val_size = int(len(filelist) * val_rate)
 59 |             train_filelist = filelist[:-val_size]
 60 |             val_filelist = filelist[-val_size:]
 61 |         else:
 62 |             train_filelist = [
 63 |                 pair for pair in filelist
 64 |                 if list(pair) not in val_filelist]
 65 |     elif split_mode == 'subdirs':
 66 |         if len(val_filelist) != 0:
 67 |             raise ValueError('The `val_filelist` option is not available in `subdirs` mode')
 68 | 
 69 |         train_filelist = make_pair(
 70 |             os.path.join(dataset_dir, 'training/mixtures'),
 71 |             os.path.join(dataset_dir, 'training/instruments'))
 72 | 
 73 |         val_filelist = make_pair(
 74 |             os.path.join(dataset_dir, 'validation/mixtures'),
 75 |             os.path.join(dataset_dir, 'validation/instruments'))
 76 | 
 77 |     return train_filelist, val_filelist
 78 | 
 79 | 
 80 | def augment(X, y, reduction_rate, reduction_mask, mixup_rate, mixup_alpha):
 81 |     perm = np.random.permutation(len(X))
 82 |     for i, idx in enumerate(tqdm(perm)):
 83 |         if np.random.uniform() < reduction_rate:
 84 |             y[idx] = spec_utils.reduce_vocal_aggressively(X[idx], y[idx], reduction_mask)
 85 | 
 86 |         if np.random.uniform() < 0.5:
 87 |             # swap channel
 88 |             X[idx] = X[idx, ::-1]
 89 |             y[idx] = y[idx, ::-1]
 90 |         if np.random.uniform() < 0.02:
 91 |             # mono
 92 |             X[idx] = X[idx].mean(axis=0, keepdims=True)
 93 |             y[idx] = y[idx].mean(axis=0, keepdims=True)
 94 |         if np.random.uniform() < 0.02:
 95 |             # inst
 96 |             X[idx] = y[idx]
 97 | 
 98 |         if np.random.uniform() < mixup_rate and i < len(perm) - 1:
 99 |             lam = np.random.beta(mixup_alpha, mixup_alpha)
100 |             X[idx] = lam * X[idx] + (1 - lam) * X[perm[i + 1]]
101 |             y[idx] = lam * y[idx] + (1 - lam) * y[perm[i + 1]]
102 | 
103 |     return X, y
104 | 
105 | 
106 | def make_padding(width, cropsize, offset):
107 |     left = offset
108 |     roi_size = cropsize - left * 2
109 |     if roi_size == 0:
110 |         roi_size = cropsize
111 |     right = roi_size - (width % roi_size) + left
112 | 
113 |     return left, right, roi_size
114 | 
115 | 
116 | def make_training_set(filelist, cropsize, patches, sr, hop_length, n_fft, offset):
117 |     len_dataset = patches * len(filelist)
118 | 
119 |     X_dataset = np.zeros(
120 |         (len_dataset, 2, n_fft // 2 + 1, cropsize), dtype=np.complex64)
121 |     y_dataset = np.zeros(
122 |         (len_dataset, 2, n_fft // 2 + 1, cropsize), dtype=np.complex64)
123 | 
124 |     for i, (X_path, y_path) in enumerate(tqdm(filelist)):
125 |         X, y = spec_utils.cache_or_load(X_path, y_path, sr, hop_length, n_fft)
126 |         coef = np.max([np.abs(X).max(), np.abs(y).max()])
127 |         X, y = X / coef, y / coef
128 | 
129 |         l, r, roi_size = make_padding(X.shape[2], cropsize, offset)
130 |         X_pad = np.pad(X, ((0, 0), (0, 0), (l, r)), mode='constant')
131 |         y_pad = np.pad(y, ((0, 0), (0, 0), (l, r)), mode='constant')
132 | 
133 |         starts = np.random.randint(0, X_pad.shape[2] - cropsize, patches)
134 |         ends = starts + cropsize
135 |         for j in range(patches):
136 |             idx = i * patches + j
137 |             X_dataset[idx] = X_pad[:, :, starts[j]:ends[j]]
138 |             y_dataset[idx] = y_pad[:, :, starts[j]:ends[j]]
139 | 
140 |     return X_dataset, y_dataset
141 | 
142 | 
143 | def make_validation_set(filelist, cropsize, sr, hop_length, n_fft, offset):
144 |     patch_list = []
145 |     patch_dir = 'cs{}_sr{}_hl{}_nf{}_of{}'.format(cropsize, sr, hop_length, n_fft, offset)
146 |     os.makedirs(patch_dir, exist_ok=True)
147 | 
148 |     for i, (X_path, y_path) in enumerate(tqdm(filelist)):
149 |         basename = os.path.splitext(os.path.basename(X_path))[0]
150 | 
151 |         X, y = spec_utils.cache_or_load(X_path, y_path, sr, hop_length, n_fft)
152 |         coef = np.max([np.abs(X).max(), np.abs(y).max()])
153 |         X, y = X / coef, y / coef
154 | 
155 |         l, r, roi_size = make_padding(X.shape[2], cropsize, offset)
156 |         X_pad = np.pad(X, ((0, 0), (0, 0), (l, r)), mode='constant')
157 |         y_pad = np.pad(y, ((0, 0), (0, 0), (l, r)), mode='constant')
158 | 
159 |         len_dataset = int(np.ceil(X.shape[2] / roi_size))
160 |         for j in range(len_dataset):
161 |             outpath = os.path.join(patch_dir, '{}_p{}.npz'.format(basename, j))
162 |             start = j * roi_size
163 |             if not os.path.exists(outpath):
164 |                 np.savez(
165 |                     outpath,
166 |                     X=X_pad[:, :, start:start + cropsize],
167 |                     y=y_pad[:, :, start:start + cropsize])
168 |             patch_list.append(outpath)
169 | 
170 |     return VocalRemoverValidationSet(patch_list)
171 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/layers.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import spec_utils
  6 | 
  7 | 
  8 | class Conv2DBNActiv(nn.Module):
  9 | 
 10 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 11 |         super(Conv2DBNActiv, self).__init__()
 12 |         self.conv = nn.Sequential(
 13 |             nn.Conv2d(
 14 |                 nin, nout,
 15 |                 kernel_size=ksize,
 16 |                 stride=stride,
 17 |                 padding=pad,
 18 |                 dilation=dilation,
 19 |                 bias=False),
 20 |             nn.BatchNorm2d(nout),
 21 |             activ()
 22 |         )
 23 | 
 24 |     def __call__(self, x):
 25 |         return self.conv(x)
 26 | 
 27 | 
 28 | class SeperableConv2DBNActiv(nn.Module):
 29 | 
 30 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 31 |         super(SeperableConv2DBNActiv, self).__init__()
 32 |         self.conv = nn.Sequential(
 33 |             nn.Conv2d(
 34 |                 nin, nin,
 35 |                 kernel_size=ksize,
 36 |                 stride=stride,
 37 |                 padding=pad,
 38 |                 dilation=dilation,
 39 |                 groups=nin,
 40 |                 bias=False),
 41 |             nn.Conv2d(
 42 |                 nin, nout,
 43 |                 kernel_size=1,
 44 |                 bias=False),
 45 |             nn.BatchNorm2d(nout),
 46 |             activ()
 47 |         )
 48 | 
 49 |     def __call__(self, x):
 50 |         return self.conv(x)
 51 | 
 52 | 
 53 | class Encoder(nn.Module):
 54 | 
 55 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
 56 |         super(Encoder, self).__init__()
 57 |         self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 58 |         self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
 59 | 
 60 |     def __call__(self, x):
 61 |         skip = self.conv1(x)
 62 |         h = self.conv2(skip)
 63 | 
 64 |         return h, skip
 65 | 
 66 | 
 67 | class Decoder(nn.Module):
 68 | 
 69 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
 70 |         super(Decoder, self).__init__()
 71 |         self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 72 |         self.dropout = nn.Dropout2d(0.1) if dropout else None
 73 | 
 74 |     def __call__(self, x, skip=None):
 75 |         x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
 76 |         if skip is not None:
 77 |             skip = spec_utils.crop_center(skip, x)
 78 |             x = torch.cat([x, skip], dim=1)
 79 |         h = self.conv(x)
 80 | 
 81 |         if self.dropout is not None:
 82 |             h = self.dropout(h)
 83 | 
 84 |         return h
 85 | 
 86 | 
 87 | class ASPPModule(nn.Module):
 88 | 
 89 |     def __init__(self, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU):
 90 |         super(ASPPModule, self).__init__()
 91 |         self.conv1 = nn.Sequential(
 92 |             nn.AdaptiveAvgPool2d((1, None)),
 93 |             Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 94 |         )
 95 |         self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 96 |         self.conv3 = SeperableConv2DBNActiv(
 97 |             nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
 98 |         self.conv4 = SeperableConv2DBNActiv(
 99 |             nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
100 |         self.conv5 = SeperableConv2DBNActiv(
101 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
102 |         self.bottleneck = nn.Sequential(
103 |             Conv2DBNActiv(nin * 5, nout, 1, 1, 0, activ=activ),
104 |             nn.Dropout2d(0.1)
105 |         )
106 | 
107 |     def forward(self, x):
108 |         _, _, h, w = x.size()
109 |         feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
110 |         feat2 = self.conv2(x)
111 |         feat3 = self.conv3(x)
112 |         feat4 = self.conv4(x)
113 |         feat5 = self.conv5(x)
114 |         out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
115 |         bottle = self.bottleneck(out)
116 |         return bottle
117 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/layers_123812KB .py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import spec_utils
  6 | 
  7 | 
  8 | class Conv2DBNActiv(nn.Module):
  9 | 
 10 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 11 |         super(Conv2DBNActiv, self).__init__()
 12 |         self.conv = nn.Sequential(
 13 |             nn.Conv2d(
 14 |                 nin, nout,
 15 |                 kernel_size=ksize,
 16 |                 stride=stride,
 17 |                 padding=pad,
 18 |                 dilation=dilation,
 19 |                 bias=False),
 20 |             nn.BatchNorm2d(nout),
 21 |             activ()
 22 |         )
 23 | 
 24 |     def __call__(self, x):
 25 |         return self.conv(x)
 26 | 
 27 | 
 28 | class SeperableConv2DBNActiv(nn.Module):
 29 | 
 30 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 31 |         super(SeperableConv2DBNActiv, self).__init__()
 32 |         self.conv = nn.Sequential(
 33 |             nn.Conv2d(
 34 |                 nin, nin,
 35 |                 kernel_size=ksize,
 36 |                 stride=stride,
 37 |                 padding=pad,
 38 |                 dilation=dilation,
 39 |                 groups=nin,
 40 |                 bias=False),
 41 |             nn.Conv2d(
 42 |                 nin, nout,
 43 |                 kernel_size=1,
 44 |                 bias=False),
 45 |             nn.BatchNorm2d(nout),
 46 |             activ()
 47 |         )
 48 | 
 49 |     def __call__(self, x):
 50 |         return self.conv(x)
 51 | 
 52 | 
 53 | class Encoder(nn.Module):
 54 | 
 55 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
 56 |         super(Encoder, self).__init__()
 57 |         self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 58 |         self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
 59 | 
 60 |     def __call__(self, x):
 61 |         skip = self.conv1(x)
 62 |         h = self.conv2(skip)
 63 | 
 64 |         return h, skip
 65 | 
 66 | 
 67 | class Decoder(nn.Module):
 68 | 
 69 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
 70 |         super(Decoder, self).__init__()
 71 |         self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 72 |         self.dropout = nn.Dropout2d(0.1) if dropout else None
 73 | 
 74 |     def __call__(self, x, skip=None):
 75 |         x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
 76 |         if skip is not None:
 77 |             skip = spec_utils.crop_center(skip, x)
 78 |             x = torch.cat([x, skip], dim=1)
 79 |         h = self.conv(x)
 80 | 
 81 |         if self.dropout is not None:
 82 |             h = self.dropout(h)
 83 | 
 84 |         return h
 85 | 
 86 | 
 87 | class ASPPModule(nn.Module):
 88 | 
 89 |     def __init__(self, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU):
 90 |         super(ASPPModule, self).__init__()
 91 |         self.conv1 = nn.Sequential(
 92 |             nn.AdaptiveAvgPool2d((1, None)),
 93 |             Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 94 |         )
 95 |         self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 96 |         self.conv3 = SeperableConv2DBNActiv(
 97 |             nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
 98 |         self.conv4 = SeperableConv2DBNActiv(
 99 |             nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
100 |         self.conv5 = SeperableConv2DBNActiv(
101 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
102 |         self.bottleneck = nn.Sequential(
103 |             Conv2DBNActiv(nin * 5, nout, 1, 1, 0, activ=activ),
104 |             nn.Dropout2d(0.1)
105 |         )
106 | 
107 |     def forward(self, x):
108 |         _, _, h, w = x.size()
109 |         feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
110 |         feat2 = self.conv2(x)
111 |         feat3 = self.conv3(x)
112 |         feat4 = self.conv4(x)
113 |         feat5 = self.conv5(x)
114 |         out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
115 |         bottle = self.bottleneck(out)
116 |         return bottle
117 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/layers_123821KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import spec_utils
  6 | 
  7 | 
  8 | class Conv2DBNActiv(nn.Module):
  9 | 
 10 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 11 |         super(Conv2DBNActiv, self).__init__()
 12 |         self.conv = nn.Sequential(
 13 |             nn.Conv2d(
 14 |                 nin, nout,
 15 |                 kernel_size=ksize,
 16 |                 stride=stride,
 17 |                 padding=pad,
 18 |                 dilation=dilation,
 19 |                 bias=False),
 20 |             nn.BatchNorm2d(nout),
 21 |             activ()
 22 |         )
 23 | 
 24 |     def __call__(self, x):
 25 |         return self.conv(x)
 26 | 
 27 | 
 28 | class SeperableConv2DBNActiv(nn.Module):
 29 | 
 30 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 31 |         super(SeperableConv2DBNActiv, self).__init__()
 32 |         self.conv = nn.Sequential(
 33 |             nn.Conv2d(
 34 |                 nin, nin,
 35 |                 kernel_size=ksize,
 36 |                 stride=stride,
 37 |                 padding=pad,
 38 |                 dilation=dilation,
 39 |                 groups=nin,
 40 |                 bias=False),
 41 |             nn.Conv2d(
 42 |                 nin, nout,
 43 |                 kernel_size=1,
 44 |                 bias=False),
 45 |             nn.BatchNorm2d(nout),
 46 |             activ()
 47 |         )
 48 | 
 49 |     def __call__(self, x):
 50 |         return self.conv(x)
 51 | 
 52 | 
 53 | class Encoder(nn.Module):
 54 | 
 55 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
 56 |         super(Encoder, self).__init__()
 57 |         self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 58 |         self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
 59 | 
 60 |     def __call__(self, x):
 61 |         skip = self.conv1(x)
 62 |         h = self.conv2(skip)
 63 | 
 64 |         return h, skip
 65 | 
 66 | 
 67 | class Decoder(nn.Module):
 68 | 
 69 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
 70 |         super(Decoder, self).__init__()
 71 |         self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 72 |         self.dropout = nn.Dropout2d(0.1) if dropout else None
 73 | 
 74 |     def __call__(self, x, skip=None):
 75 |         x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
 76 |         if skip is not None:
 77 |             skip = spec_utils.crop_center(skip, x)
 78 |             x = torch.cat([x, skip], dim=1)
 79 |         h = self.conv(x)
 80 | 
 81 |         if self.dropout is not None:
 82 |             h = self.dropout(h)
 83 | 
 84 |         return h
 85 | 
 86 | 
 87 | class ASPPModule(nn.Module):
 88 | 
 89 |     def __init__(self, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU):
 90 |         super(ASPPModule, self).__init__()
 91 |         self.conv1 = nn.Sequential(
 92 |             nn.AdaptiveAvgPool2d((1, None)),
 93 |             Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 94 |         )
 95 |         self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 96 |         self.conv3 = SeperableConv2DBNActiv(
 97 |             nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
 98 |         self.conv4 = SeperableConv2DBNActiv(
 99 |             nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
100 |         self.conv5 = SeperableConv2DBNActiv(
101 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
102 |         self.bottleneck = nn.Sequential(
103 |             Conv2DBNActiv(nin * 5, nout, 1, 1, 0, activ=activ),
104 |             nn.Dropout2d(0.1)
105 |         )
106 | 
107 |     def forward(self, x):
108 |         _, _, h, w = x.size()
109 |         feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
110 |         feat2 = self.conv2(x)
111 |         feat3 = self.conv3(x)
112 |         feat4 = self.conv4(x)
113 |         feat5 = self.conv5(x)
114 |         out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
115 |         bottle = self.bottleneck(out)
116 |         return bottle
117 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/layers_33966KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import spec_utils
  6 | 
  7 | 
  8 | class Conv2DBNActiv(nn.Module):
  9 | 
 10 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 11 |         super(Conv2DBNActiv, self).__init__()
 12 |         self.conv = nn.Sequential(
 13 |             nn.Conv2d(
 14 |                 nin, nout,
 15 |                 kernel_size=ksize,
 16 |                 stride=stride,
 17 |                 padding=pad,
 18 |                 dilation=dilation,
 19 |                 bias=False),
 20 |             nn.BatchNorm2d(nout),
 21 |             activ()
 22 |         )
 23 | 
 24 |     def __call__(self, x):
 25 |         return self.conv(x)
 26 | 
 27 | 
 28 | class SeperableConv2DBNActiv(nn.Module):
 29 | 
 30 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 31 |         super(SeperableConv2DBNActiv, self).__init__()
 32 |         self.conv = nn.Sequential(
 33 |             nn.Conv2d(
 34 |                 nin, nin,
 35 |                 kernel_size=ksize,
 36 |                 stride=stride,
 37 |                 padding=pad,
 38 |                 dilation=dilation,
 39 |                 groups=nin,
 40 |                 bias=False),
 41 |             nn.Conv2d(
 42 |                 nin, nout,
 43 |                 kernel_size=1,
 44 |                 bias=False),
 45 |             nn.BatchNorm2d(nout),
 46 |             activ()
 47 |         )
 48 | 
 49 |     def __call__(self, x):
 50 |         return self.conv(x)
 51 | 
 52 | 
 53 | class Encoder(nn.Module):
 54 | 
 55 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
 56 |         super(Encoder, self).__init__()
 57 |         self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 58 |         self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
 59 | 
 60 |     def __call__(self, x):
 61 |         skip = self.conv1(x)
 62 |         h = self.conv2(skip)
 63 | 
 64 |         return h, skip
 65 | 
 66 | 
 67 | class Decoder(nn.Module):
 68 | 
 69 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
 70 |         super(Decoder, self).__init__()
 71 |         self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 72 |         self.dropout = nn.Dropout2d(0.1) if dropout else None
 73 | 
 74 |     def __call__(self, x, skip=None):
 75 |         x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
 76 |         if skip is not None:
 77 |             skip = spec_utils.crop_center(skip, x)
 78 |             x = torch.cat([x, skip], dim=1)
 79 |         h = self.conv(x)
 80 | 
 81 |         if self.dropout is not None:
 82 |             h = self.dropout(h)
 83 | 
 84 |         return h
 85 | 
 86 | 
 87 | class ASPPModule(nn.Module):
 88 | 
 89 |     def __init__(self, nin, nout, dilations=(4, 8, 16, 32, 64), activ=nn.ReLU):
 90 |         super(ASPPModule, self).__init__()
 91 |         self.conv1 = nn.Sequential(
 92 |             nn.AdaptiveAvgPool2d((1, None)),
 93 |             Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 94 |         )
 95 |         self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 96 |         self.conv3 = SeperableConv2DBNActiv(
 97 |             nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
 98 |         self.conv4 = SeperableConv2DBNActiv(
 99 |             nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
100 |         self.conv5 = SeperableConv2DBNActiv(
101 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
102 |         self.conv6 = SeperableConv2DBNActiv(
103 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
104 |         self.conv7 = SeperableConv2DBNActiv(
105 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
106 |         self.bottleneck = nn.Sequential(
107 |             Conv2DBNActiv(nin * 7, nout, 1, 1, 0, activ=activ),
108 |             nn.Dropout2d(0.1)
109 |         )
110 | 
111 |     def forward(self, x):
112 |         _, _, h, w = x.size()
113 |         feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
114 |         feat2 = self.conv2(x)
115 |         feat3 = self.conv3(x)
116 |         feat4 = self.conv4(x)
117 |         feat5 = self.conv5(x)
118 |         feat6 = self.conv6(x)
119 |         feat7 = self.conv7(x)
120 |         out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1)
121 |         bottle = self.bottleneck(out)
122 |         return bottle
123 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/layers_537227KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import spec_utils
  6 | 
  7 | 
  8 | class Conv2DBNActiv(nn.Module):
  9 | 
 10 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 11 |         super(Conv2DBNActiv, self).__init__()
 12 |         self.conv = nn.Sequential(
 13 |             nn.Conv2d(
 14 |                 nin, nout,
 15 |                 kernel_size=ksize,
 16 |                 stride=stride,
 17 |                 padding=pad,
 18 |                 dilation=dilation,
 19 |                 bias=False),
 20 |             nn.BatchNorm2d(nout),
 21 |             activ()
 22 |         )
 23 | 
 24 |     def __call__(self, x):
 25 |         return self.conv(x)
 26 | 
 27 | 
 28 | class SeperableConv2DBNActiv(nn.Module):
 29 | 
 30 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 31 |         super(SeperableConv2DBNActiv, self).__init__()
 32 |         self.conv = nn.Sequential(
 33 |             nn.Conv2d(
 34 |                 nin, nin,
 35 |                 kernel_size=ksize,
 36 |                 stride=stride,
 37 |                 padding=pad,
 38 |                 dilation=dilation,
 39 |                 groups=nin,
 40 |                 bias=False),
 41 |             nn.Conv2d(
 42 |                 nin, nout,
 43 |                 kernel_size=1,
 44 |                 bias=False),
 45 |             nn.BatchNorm2d(nout),
 46 |             activ()
 47 |         )
 48 | 
 49 |     def __call__(self, x):
 50 |         return self.conv(x)
 51 | 
 52 | 
 53 | class Encoder(nn.Module):
 54 | 
 55 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
 56 |         super(Encoder, self).__init__()
 57 |         self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 58 |         self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
 59 | 
 60 |     def __call__(self, x):
 61 |         skip = self.conv1(x)
 62 |         h = self.conv2(skip)
 63 | 
 64 |         return h, skip
 65 | 
 66 | 
 67 | class Decoder(nn.Module):
 68 | 
 69 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
 70 |         super(Decoder, self).__init__()
 71 |         self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 72 |         self.dropout = nn.Dropout2d(0.1) if dropout else None
 73 | 
 74 |     def __call__(self, x, skip=None):
 75 |         x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
 76 |         if skip is not None:
 77 |             skip = spec_utils.crop_center(skip, x)
 78 |             x = torch.cat([x, skip], dim=1)
 79 |         h = self.conv(x)
 80 | 
 81 |         if self.dropout is not None:
 82 |             h = self.dropout(h)
 83 | 
 84 |         return h
 85 | 
 86 | 
 87 | class ASPPModule(nn.Module):
 88 | 
 89 |     def __init__(self, nin, nout, dilations=(4, 8, 16, 32, 64), activ=nn.ReLU):
 90 |         super(ASPPModule, self).__init__()
 91 |         self.conv1 = nn.Sequential(
 92 |             nn.AdaptiveAvgPool2d((1, None)),
 93 |             Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 94 |         )
 95 |         self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 96 |         self.conv3 = SeperableConv2DBNActiv(
 97 |             nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
 98 |         self.conv4 = SeperableConv2DBNActiv(
 99 |             nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
100 |         self.conv5 = SeperableConv2DBNActiv(
101 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
102 |         self.conv6 = SeperableConv2DBNActiv(
103 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
104 |         self.conv7 = SeperableConv2DBNActiv(
105 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
106 |         self.bottleneck = nn.Sequential(
107 |             Conv2DBNActiv(nin * 7, nout, 1, 1, 0, activ=activ),
108 |             nn.Dropout2d(0.1)
109 |         )
110 | 
111 |     def forward(self, x):
112 |         _, _, h, w = x.size()
113 |         feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
114 |         feat2 = self.conv2(x)
115 |         feat3 = self.conv3(x)
116 |         feat4 = self.conv4(x)
117 |         feat5 = self.conv5(x)
118 |         feat6 = self.conv6(x)
119 |         feat7 = self.conv7(x)
120 |         out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1)
121 |         bottle = self.bottleneck(out)
122 |         return bottle
123 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/layers_537238KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import spec_utils
  6 | 
  7 | 
  8 | class Conv2DBNActiv(nn.Module):
  9 | 
 10 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 11 |         super(Conv2DBNActiv, self).__init__()
 12 |         self.conv = nn.Sequential(
 13 |             nn.Conv2d(
 14 |                 nin, nout,
 15 |                 kernel_size=ksize,
 16 |                 stride=stride,
 17 |                 padding=pad,
 18 |                 dilation=dilation,
 19 |                 bias=False),
 20 |             nn.BatchNorm2d(nout),
 21 |             activ()
 22 |         )
 23 | 
 24 |     def __call__(self, x):
 25 |         return self.conv(x)
 26 | 
 27 | 
 28 | class SeperableConv2DBNActiv(nn.Module):
 29 | 
 30 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
 31 |         super(SeperableConv2DBNActiv, self).__init__()
 32 |         self.conv = nn.Sequential(
 33 |             nn.Conv2d(
 34 |                 nin, nin,
 35 |                 kernel_size=ksize,
 36 |                 stride=stride,
 37 |                 padding=pad,
 38 |                 dilation=dilation,
 39 |                 groups=nin,
 40 |                 bias=False),
 41 |             nn.Conv2d(
 42 |                 nin, nout,
 43 |                 kernel_size=1,
 44 |                 bias=False),
 45 |             nn.BatchNorm2d(nout),
 46 |             activ()
 47 |         )
 48 | 
 49 |     def __call__(self, x):
 50 |         return self.conv(x)
 51 | 
 52 | 
 53 | class Encoder(nn.Module):
 54 | 
 55 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
 56 |         super(Encoder, self).__init__()
 57 |         self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 58 |         self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
 59 | 
 60 |     def __call__(self, x):
 61 |         skip = self.conv1(x)
 62 |         h = self.conv2(skip)
 63 | 
 64 |         return h, skip
 65 | 
 66 | 
 67 | class Decoder(nn.Module):
 68 | 
 69 |     def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
 70 |         super(Decoder, self).__init__()
 71 |         self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
 72 |         self.dropout = nn.Dropout2d(0.1) if dropout else None
 73 | 
 74 |     def __call__(self, x, skip=None):
 75 |         x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
 76 |         if skip is not None:
 77 |             skip = spec_utils.crop_center(skip, x)
 78 |             x = torch.cat([x, skip], dim=1)
 79 |         h = self.conv(x)
 80 | 
 81 |         if self.dropout is not None:
 82 |             h = self.dropout(h)
 83 | 
 84 |         return h
 85 | 
 86 | 
 87 | class ASPPModule(nn.Module):
 88 | 
 89 |     def __init__(self, nin, nout, dilations=(4, 8, 16, 32, 64), activ=nn.ReLU):
 90 |         super(ASPPModule, self).__init__()
 91 |         self.conv1 = nn.Sequential(
 92 |             nn.AdaptiveAvgPool2d((1, None)),
 93 |             Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 94 |         )
 95 |         self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
 96 |         self.conv3 = SeperableConv2DBNActiv(
 97 |             nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
 98 |         self.conv4 = SeperableConv2DBNActiv(
 99 |             nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
100 |         self.conv5 = SeperableConv2DBNActiv(
101 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
102 |         self.conv6 = SeperableConv2DBNActiv(
103 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
104 |         self.conv7 = SeperableConv2DBNActiv(
105 |             nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
106 |         self.bottleneck = nn.Sequential(
107 |             Conv2DBNActiv(nin * 7, nout, 1, 1, 0, activ=activ),
108 |             nn.Dropout2d(0.1)
109 |         )
110 | 
111 |     def forward(self, x):
112 |         _, _, h, w = x.size()
113 |         feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
114 |         feat2 = self.conv2(x)
115 |         feat3 = self.conv3(x)
116 |         feat4 = self.conv4(x)
117 |         feat5 = self.conv5(x)
118 |         feat6 = self.conv6(x)
119 |         feat7 = self.conv7(x)
120 |         out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1)
121 |         bottle = self.bottleneck(out)
122 |         return bottle
123 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/model_param_init.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import pathlib
 4 | 
 5 | default_param = {}
 6 | default_param['bins'] = 768
 7 | default_param['unstable_bins'] = 9 # training only
 8 | default_param['reduction_bins'] = 762 # training only
 9 | default_param['sr'] = 44100
10 | default_param['pre_filter_start'] = 757
11 | default_param['pre_filter_stop'] = 768
12 | default_param['band'] = {}
13 | 
14 | 
15 | default_param['band'][1] = {
16 |     'sr': 11025,
17 |     'hl': 128,
18 |     'n_fft': 960,
19 |     'crop_start': 0,
20 |     'crop_stop': 245,
21 |     'lpf_start': 61, # inference only
22 |     'res_type': 'polyphase'
23 | }
24 | 
25 | default_param['band'][2] = {
26 |     'sr': 44100,
27 |     'hl': 512,
28 |     'n_fft': 1536,
29 |     'crop_start': 24,
30 |     'crop_stop': 547,
31 |     'hpf_start': 81, # inference only
32 |     'res_type': 'sinc_best'
33 | }
34 | 
35 | 
36 | def int_keys(d):
37 |     r = {}
38 |     for k, v in d:
39 |         if k.isdigit():
40 |             k = int(k)
41 |         r[k] = v
42 |     return r
43 |     
44 | 
45 | class ModelParameters(object):
46 |     def __init__(self, config_path=''):
47 |         if '.pth' == pathlib.Path(config_path).suffix:
48 |             import zipfile
49 |             
50 |             with zipfile.ZipFile(config_path, 'r') as zip:
51 |                 self.param = json.loads(zip.read('param.json'), object_pairs_hook=int_keys)
52 |         elif '.json' == pathlib.Path(config_path).suffix:
53 |             with open(config_path, 'r') as f:
54 |                 self.param = json.loads(f.read(), object_pairs_hook=int_keys)
55 |         else:
56 |             self.param = default_param
57 |             
58 |         for k in ['mid_side', 'mid_side_b', 'mid_side_b2', 'stereo_w', 'stereo_n', 'reverse']:
59 |             if not k in self.param:
60 |                 self.param[k] = False


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr16000_hl512.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 1024,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 16000,
 8 | 			"hl": 512,
 9 | 			"n_fft": 2048,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 1024,
12 | 			"hpf_start": -1,
13 | 			"res_type": "sinc_best"
14 | 		}
15 | 	},
16 | 	"sr": 16000,
17 | 	"pre_filter_start": 1023,
18 | 	"pre_filter_stop": 1024
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 1024,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 32000,
 8 | 			"hl": 512,
 9 | 			"n_fft": 2048,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 1024,
12 | 			"hpf_start": -1,
13 | 			"res_type": "kaiser_fast"
14 | 		}
15 | 	},
16 | 	"sr": 32000,
17 | 	"pre_filter_start": 1000,
18 | 	"pre_filter_stop": 1021
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 1024,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 33075,
 8 | 			"hl": 384,
 9 | 			"n_fft": 2048,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 1024,
12 | 			"hpf_start": -1,
13 | 			"res_type": "sinc_best"
14 | 		}
15 | 	},
16 | 	"sr": 33075,
17 | 	"pre_filter_start": 1000,
18 | 	"pre_filter_stop": 1021
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr44100_hl1024.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 1024,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 44100,
 8 | 			"hl": 1024,
 9 | 			"n_fft": 2048,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 1024,
12 | 			"hpf_start": -1,
13 | 			"res_type": "sinc_best"
14 | 		}
15 | 	},
16 | 	"sr": 44100,
17 | 	"pre_filter_start": 1023,
18 | 	"pre_filter_stop": 1024
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr44100_hl256.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 256,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 44100,
 8 | 			"hl": 256,
 9 | 			"n_fft": 512,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 256,
12 | 			"hpf_start": -1,
13 | 			"res_type": "sinc_best"
14 | 		}
15 | 	},
16 | 	"sr": 44100,
17 | 	"pre_filter_start": 256,
18 | 	"pre_filter_stop": 256
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 1024,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 44100,
 8 | 			"hl": 512,
 9 | 			"n_fft": 2048,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 1024,
12 | 			"hpf_start": -1,
13 | 			"res_type": "sinc_best"
14 | 		}
15 | 	},
16 | 	"sr": 44100,
17 | 	"pre_filter_start": 1023,
18 | 	"pre_filter_stop": 1024
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512_cut.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 1024,
 3 | 	"unstable_bins": 0,
 4 | 	"reduction_bins": 0,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 44100,
 8 | 			"hl": 512,
 9 | 			"n_fft": 2048,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 700,
12 | 			"hpf_start": -1,
13 | 			"res_type": "sinc_best"
14 | 		}
15 | 	},
16 | 	"sr": 44100,
17 | 	"pre_filter_start": 1023,
18 | 	"pre_filter_stop": 700
19 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/2band_32000.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 768,
 3 | 	"unstable_bins": 7,
 4 | 	"reduction_bins": 705,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 6000,
 8 | 			"hl": 66,
 9 | 			"n_fft": 512,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 240,
12 | 			"lpf_start": 60,
13 | 			"lpf_stop": 118,
14 | 			"res_type": "sinc_fastest"
15 | 		},
16 | 		"2": {
17 | 			"sr": 32000,
18 | 			"hl": 352,
19 | 			"n_fft": 1024,
20 | 			"crop_start": 22,
21 | 			"crop_stop": 505,
22 | 			"hpf_start": 44,
23 | 			"hpf_stop": 23,
24 | 			"res_type": "sinc_medium"
25 | 		}
26 | 	},
27 | 	"sr": 32000,
28 | 	"pre_filter_start": 710,
29 | 	"pre_filter_stop": 731
30 | }
31 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/2band_44100_lofi.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 512,
 3 | 	"unstable_bins": 7,
 4 | 	"reduction_bins": 510,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 11025,
 8 | 			"hl": 160,
 9 | 			"n_fft": 768,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 192,
12 | 			"lpf_start": 41,
13 | 			"lpf_stop": 139,
14 | 			"res_type": "sinc_fastest"
15 | 		},
16 | 		"2": {
17 | 			"sr": 44100,
18 | 			"hl": 640,
19 | 			"n_fft": 1024,
20 | 			"crop_start": 10,
21 | 			"crop_stop": 320,
22 | 			"hpf_start": 47,
23 | 			"hpf_stop": 15,
24 | 			"res_type": "sinc_medium"
25 | 		}
26 | 	},
27 | 	"sr": 44100,
28 | 	"pre_filter_start": 510,
29 | 	"pre_filter_stop": 512
30 | }
31 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/2band_48000.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 768,
 3 | 	"unstable_bins": 7,
 4 | 	"reduction_bins": 705,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 6000,
 8 | 			"hl": 66,
 9 | 			"n_fft": 512,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 240,
12 | 			"lpf_start": 60,
13 | 			"lpf_stop": 240,
14 | 			"res_type": "sinc_fastest"
15 | 		},
16 | 		"2": {
17 | 			"sr": 48000,
18 | 			"hl": 528,
19 | 			"n_fft": 1536,
20 | 			"crop_start": 22,
21 | 			"crop_stop": 505,
22 | 			"hpf_start": 82,
23 | 			"hpf_stop": 22,
24 | 			"res_type": "sinc_medium"
25 | 		}
26 | 	},
27 | 	"sr": 48000,
28 | 	"pre_filter_start": 710,
29 | 	"pre_filter_stop": 731
30 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/3band_44100.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 768,
 3 | 	"unstable_bins": 5,
 4 | 	"reduction_bins": 733,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 11025,
 8 | 			"hl": 128,
 9 | 			"n_fft": 768,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 278,
12 | 			"lpf_start": 28,
13 | 			"lpf_stop": 140,
14 | 			"res_type": "polyphase"
15 | 		},
16 | 		"2": {
17 | 			"sr": 22050,
18 | 			"hl": 256,
19 | 			"n_fft": 768,
20 | 			"crop_start": 14,
21 | 			"crop_stop": 322,
22 | 			"hpf_start": 70,
23 | 			"hpf_stop": 14,
24 | 			"lpf_start": 283,
25 | 			"lpf_stop": 314,
26 | 			"res_type": "polyphase"
27 | 		},	
28 | 		"3": {
29 | 			"sr": 44100,
30 | 			"hl": 512,
31 | 			"n_fft": 768,
32 | 			"crop_start": 131,
33 | 			"crop_stop": 313,
34 | 			"hpf_start": 154,
35 | 			"hpf_stop": 141,
36 | 			"res_type": "sinc_medium"
37 | 		}
38 | 	},
39 | 	"sr": 44100,
40 | 	"pre_filter_start": 757,
41 | 	"pre_filter_stop": 768
42 | }
43 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/3band_44100_mid.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"mid_side": true,
 3 | 	"bins": 768,
 4 | 	"unstable_bins": 5,
 5 | 	"reduction_bins": 733,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 128,
10 | 			"n_fft": 768,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 278,
13 | 			"lpf_start": 28,
14 | 			"lpf_stop": 140,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 22050,
19 | 			"hl": 256,
20 | 			"n_fft": 768,
21 | 			"crop_start": 14,
22 | 			"crop_stop": 322,
23 | 			"hpf_start": 70,
24 | 			"hpf_stop": 14,
25 | 			"lpf_start": 283,
26 | 			"lpf_stop": 314,
27 | 			"res_type": "polyphase"
28 | 		},	
29 | 		"3": {
30 | 			"sr": 44100,
31 | 			"hl": 512,
32 | 			"n_fft": 768,
33 | 			"crop_start": 131,
34 | 			"crop_stop": 313,
35 | 			"hpf_start": 154,
36 | 			"hpf_stop": 141,
37 | 			"res_type": "sinc_medium"
38 | 		}
39 | 	},
40 | 	"sr": 44100,
41 | 	"pre_filter_start": 757,
42 | 	"pre_filter_stop": 768
43 | }
44 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"mid_side_b2": true,
 3 | 	"bins": 640,
 4 | 	"unstable_bins": 7,
 5 | 	"reduction_bins": 565,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 108,
10 | 			"n_fft": 1024,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 187,
13 | 			"lpf_start": 92,
14 | 			"lpf_stop": 186,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 22050,
19 | 			"hl": 216,
20 | 			"n_fft": 768,
21 | 			"crop_start": 0,
22 | 			"crop_stop": 212,
23 | 			"hpf_start": 68,
24 | 			"hpf_stop": 34,
25 | 			"lpf_start": 174,
26 | 			"lpf_stop": 209,
27 | 			"res_type": "polyphase"
28 | 		},	
29 | 		"3": {
30 | 			"sr": 44100,
31 | 			"hl": 432,
32 | 			"n_fft": 640,
33 | 			"crop_start": 66,
34 | 			"crop_stop": 307,
35 | 			"hpf_start": 86,
36 | 			"hpf_stop": 72,
37 | 			"res_type": "kaiser_fast"
38 | 		}
39 | 	},
40 | 	"sr": 44100,
41 | 	"pre_filter_start": 639,
42 | 	"pre_filter_stop": 640
43 | }
44 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_44100.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 768,
 3 | 	"unstable_bins": 7,
 4 | 	"reduction_bins": 668,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 11025,
 8 | 			"hl": 128,
 9 | 			"n_fft": 1024,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 186,
12 | 			"lpf_start": 37,
13 | 			"lpf_stop": 73,
14 | 			"res_type": "polyphase"
15 | 		},
16 | 		"2": {
17 | 			"sr": 11025,
18 | 			"hl": 128,
19 | 			"n_fft": 512,
20 | 			"crop_start": 4,
21 | 			"crop_stop": 185,			
22 | 			"hpf_start": 36,
23 | 			"hpf_stop": 18,
24 | 			"lpf_start": 93,
25 | 			"lpf_stop": 185,
26 | 			"res_type": "polyphase"
27 | 		},
28 | 		"3": {
29 | 			"sr": 22050,
30 | 			"hl": 256,
31 | 			"n_fft": 512,
32 | 			"crop_start": 46,
33 | 			"crop_stop": 186,
34 | 			"hpf_start": 93,
35 | 			"hpf_stop": 46,
36 | 			"lpf_start": 164,
37 | 			"lpf_stop": 186,
38 | 			"res_type": "polyphase"
39 | 		},	
40 | 		"4": {
41 | 			"sr": 44100,
42 | 			"hl": 512,
43 | 			"n_fft": 768,
44 | 			"crop_start": 121,
45 | 			"crop_stop": 382,
46 | 			"hpf_start": 138,
47 | 			"hpf_stop": 123,
48 | 			"res_type": "sinc_medium"
49 | 		}
50 | 	},
51 | 	"sr": 44100,
52 | 	"pre_filter_start": 740,
53 | 	"pre_filter_stop": 768
54 | }
55 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_44100_mid.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 768,
 3 | 	"unstable_bins": 7,
 4 | 	"mid_side": true,
 5 | 	"reduction_bins": 668,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 128,
10 | 			"n_fft": 1024,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 186,
13 | 			"lpf_start": 37,
14 | 			"lpf_stop": 73,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 11025,
19 | 			"hl": 128,
20 | 			"n_fft": 512,
21 | 			"crop_start": 4,
22 | 			"crop_stop": 185,			
23 | 			"hpf_start": 36,
24 | 			"hpf_stop": 18,
25 | 			"lpf_start": 93,
26 | 			"lpf_stop": 185,
27 | 			"res_type": "polyphase"
28 | 		},
29 | 		"3": {
30 | 			"sr": 22050,
31 | 			"hl": 256,
32 | 			"n_fft": 512,
33 | 			"crop_start": 46,
34 | 			"crop_stop": 186,
35 | 			"hpf_start": 93,
36 | 			"hpf_stop": 46,
37 | 			"lpf_start": 164,
38 | 			"lpf_stop": 186,
39 | 			"res_type": "polyphase"
40 | 		},	
41 | 		"4": {
42 | 			"sr": 44100,
43 | 			"hl": 512,
44 | 			"n_fft": 768,
45 | 			"crop_start": 121,
46 | 			"crop_stop": 382,
47 | 			"hpf_start": 138,
48 | 			"hpf_stop": 123,
49 | 			"res_type": "sinc_medium"
50 | 		}
51 | 	},
52 | 	"sr": 44100,
53 | 	"pre_filter_start": 740,
54 | 	"pre_filter_stop": 768
55 | }
56 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_44100_msb.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"mid_side_b": true,
 3 | 	"bins": 768,
 4 | 	"unstable_bins": 7,
 5 | 	"reduction_bins": 668,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 128,
10 | 			"n_fft": 1024,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 186,
13 | 			"lpf_start": 37,
14 | 			"lpf_stop": 73,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 11025,
19 | 			"hl": 128,
20 | 			"n_fft": 512,
21 | 			"crop_start": 4,
22 | 			"crop_stop": 185,			
23 | 			"hpf_start": 36,
24 | 			"hpf_stop": 18,
25 | 			"lpf_start": 93,
26 | 			"lpf_stop": 185,
27 | 			"res_type": "polyphase"
28 | 		},
29 | 		"3": {
30 | 			"sr": 22050,
31 | 			"hl": 256,
32 | 			"n_fft": 512,
33 | 			"crop_start": 46,
34 | 			"crop_stop": 186,
35 | 			"hpf_start": 93,
36 | 			"hpf_stop": 46,
37 | 			"lpf_start": 164,
38 | 			"lpf_stop": 186,
39 | 			"res_type": "polyphase"
40 | 		},	
41 | 		"4": {
42 | 			"sr": 44100,
43 | 			"hl": 512,
44 | 			"n_fft": 768,
45 | 			"crop_start": 121,
46 | 			"crop_stop": 382,
47 | 			"hpf_start": 138,
48 | 			"hpf_stop": 123,
49 | 			"res_type": "sinc_medium"
50 | 		}
51 | 	},
52 | 	"sr": 44100,
53 | 	"pre_filter_start": 740,
54 | 	"pre_filter_stop": 768
55 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_44100_msb2.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"mid_side_b": true,
 3 | 	"bins": 768,
 4 | 	"unstable_bins": 7,
 5 | 	"reduction_bins": 668,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 128,
10 | 			"n_fft": 1024,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 186,
13 | 			"lpf_start": 37,
14 | 			"lpf_stop": 73,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 11025,
19 | 			"hl": 128,
20 | 			"n_fft": 512,
21 | 			"crop_start": 4,
22 | 			"crop_stop": 185,			
23 | 			"hpf_start": 36,
24 | 			"hpf_stop": 18,
25 | 			"lpf_start": 93,
26 | 			"lpf_stop": 185,
27 | 			"res_type": "polyphase"
28 | 		},
29 | 		"3": {
30 | 			"sr": 22050,
31 | 			"hl": 256,
32 | 			"n_fft": 512,
33 | 			"crop_start": 46,
34 | 			"crop_stop": 186,
35 | 			"hpf_start": 93,
36 | 			"hpf_stop": 46,
37 | 			"lpf_start": 164,
38 | 			"lpf_stop": 186,
39 | 			"res_type": "polyphase"
40 | 		},	
41 | 		"4": {
42 | 			"sr": 44100,
43 | 			"hl": 512,
44 | 			"n_fft": 768,
45 | 			"crop_start": 121,
46 | 			"crop_stop": 382,
47 | 			"hpf_start": 138,
48 | 			"hpf_stop": 123,
49 | 			"res_type": "sinc_medium"
50 | 		}
51 | 	},
52 | 	"sr": 44100,
53 | 	"pre_filter_start": 740,
54 | 	"pre_filter_stop": 768
55 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_44100_reverse.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"reverse": true,
 3 | 	"bins": 768,
 4 | 	"unstable_bins": 7,
 5 | 	"reduction_bins": 668,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 128,
10 | 			"n_fft": 1024,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 186,
13 | 			"lpf_start": 37,
14 | 			"lpf_stop": 73,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 11025,
19 | 			"hl": 128,
20 | 			"n_fft": 512,
21 | 			"crop_start": 4,
22 | 			"crop_stop": 185,			
23 | 			"hpf_start": 36,
24 | 			"hpf_stop": 18,
25 | 			"lpf_start": 93,
26 | 			"lpf_stop": 185,
27 | 			"res_type": "polyphase"
28 | 		},
29 | 		"3": {
30 | 			"sr": 22050,
31 | 			"hl": 256,
32 | 			"n_fft": 512,
33 | 			"crop_start": 46,
34 | 			"crop_stop": 186,
35 | 			"hpf_start": 93,
36 | 			"hpf_stop": 46,
37 | 			"lpf_start": 164,
38 | 			"lpf_stop": 186,
39 | 			"res_type": "polyphase"
40 | 		},	
41 | 		"4": {
42 | 			"sr": 44100,
43 | 			"hl": 512,
44 | 			"n_fft": 768,
45 | 			"crop_start": 121,
46 | 			"crop_stop": 382,
47 | 			"hpf_start": 138,
48 | 			"hpf_stop": 123,
49 | 			"res_type": "sinc_medium"
50 | 		}
51 | 	},
52 | 	"sr": 44100,
53 | 	"pre_filter_start": 740,
54 | 	"pre_filter_stop": 768
55 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_44100_sw.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"stereo_w": true,
 3 | 	"bins": 768,
 4 | 	"unstable_bins": 7,
 5 | 	"reduction_bins": 668,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 128,
10 | 			"n_fft": 1024,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 186,
13 | 			"lpf_start": 37,
14 | 			"lpf_stop": 73,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 11025,
19 | 			"hl": 128,
20 | 			"n_fft": 512,
21 | 			"crop_start": 4,
22 | 			"crop_stop": 185,			
23 | 			"hpf_start": 36,
24 | 			"hpf_stop": 18,
25 | 			"lpf_start": 93,
26 | 			"lpf_stop": 185,
27 | 			"res_type": "polyphase"
28 | 		},
29 | 		"3": {
30 | 			"sr": 22050,
31 | 			"hl": 256,
32 | 			"n_fft": 512,
33 | 			"crop_start": 46,
34 | 			"crop_stop": 186,
35 | 			"hpf_start": 93,
36 | 			"hpf_stop": 46,
37 | 			"lpf_start": 164,
38 | 			"lpf_stop": 186,
39 | 			"res_type": "polyphase"
40 | 		},	
41 | 		"4": {
42 | 			"sr": 44100,
43 | 			"hl": 512,
44 | 			"n_fft": 768,
45 | 			"crop_start": 121,
46 | 			"crop_stop": 382,
47 | 			"hpf_start": 138,
48 | 			"hpf_stop": 123,
49 | 			"res_type": "sinc_medium"
50 | 		}
51 | 	},
52 | 	"sr": 44100,
53 | 	"pre_filter_start": 740,
54 | 	"pre_filter_stop": 768
55 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_v2.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 672,
 3 | 	"unstable_bins": 8,
 4 | 	"reduction_bins": 637,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 7350,
 8 | 			"hl": 80,
 9 | 			"n_fft": 640,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 85,
12 | 			"lpf_start": 25,
13 | 			"lpf_stop": 53,
14 | 			"res_type": "polyphase"
15 | 		},
16 | 		"2": {
17 | 			"sr": 7350,
18 | 			"hl": 80,
19 | 			"n_fft": 320,
20 | 			"crop_start": 4,
21 | 			"crop_stop": 87,
22 | 			"hpf_start": 25,
23 | 			"hpf_stop": 12,
24 | 			"lpf_start": 31,
25 | 			"lpf_stop": 62,
26 | 			"res_type": "polyphase"
27 | 		},		
28 | 		"3": {
29 | 			"sr": 14700,
30 | 			"hl": 160,
31 | 			"n_fft": 512,
32 | 			"crop_start": 17,
33 | 			"crop_stop": 216,
34 | 			"hpf_start": 48,
35 | 			"hpf_stop": 24,
36 | 			"lpf_start": 139,
37 | 			"lpf_stop": 210,
38 | 			"res_type": "polyphase"
39 | 		},	
40 | 		"4": {
41 | 			"sr": 44100,
42 | 			"hl": 480,
43 | 			"n_fft": 960,
44 | 			"crop_start": 78,
45 | 			"crop_stop": 383,
46 | 			"hpf_start": 130,
47 | 			"hpf_stop": 86,
48 | 			"res_type": "kaiser_fast"
49 | 		}
50 | 	},
51 | 	"sr": 44100,
52 | 	"pre_filter_start": 668,
53 | 	"pre_filter_stop": 672
54 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/4band_v2_sn.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"bins": 672,
 3 | 	"unstable_bins": 8,
 4 | 	"reduction_bins": 637,
 5 | 	"band": {
 6 | 		"1": {
 7 | 			"sr": 7350,
 8 | 			"hl": 80,
 9 | 			"n_fft": 640,
10 | 			"crop_start": 0,
11 | 			"crop_stop": 85,
12 | 			"lpf_start": 25,
13 | 			"lpf_stop": 53,
14 | 			"res_type": "polyphase"
15 | 		},
16 | 		"2": {
17 | 			"sr": 7350,
18 | 			"hl": 80,
19 | 			"n_fft": 320,
20 | 			"crop_start": 4,
21 | 			"crop_stop": 87,
22 | 			"hpf_start": 25,
23 | 			"hpf_stop": 12,
24 | 			"lpf_start": 31,
25 | 			"lpf_stop": 62,
26 | 			"res_type": "polyphase"
27 | 		},		
28 | 		"3": {
29 | 			"sr": 14700,
30 | 			"hl": 160,
31 | 			"n_fft": 512,
32 | 			"crop_start": 17,
33 | 			"crop_stop": 216,
34 | 			"hpf_start": 48,
35 | 			"hpf_stop": 24,
36 | 			"lpf_start": 139,
37 | 			"lpf_stop": 210,
38 | 			"res_type": "polyphase"
39 | 		},	
40 | 		"4": {
41 | 			"sr": 44100,
42 | 			"hl": 480,
43 | 			"n_fft": 960,
44 | 			"crop_start": 78,
45 | 			"crop_stop": 383,
46 | 			"hpf_start": 130,
47 | 			"hpf_stop": 86,
48 | 			"convert_channels": "stereo_n",
49 | 			"res_type": "kaiser_fast"
50 | 		}
51 | 	},
52 | 	"sr": 44100,
53 | 	"pre_filter_start": 668,
54 | 	"pre_filter_stop": 672
55 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/modelparams/ensemble.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"mid_side_b2": true,
 3 | 	"bins": 1280,
 4 | 	"unstable_bins": 7,
 5 | 	"reduction_bins": 565,
 6 | 	"band": {
 7 | 		"1": {
 8 | 			"sr": 11025,
 9 | 			"hl": 108,
10 | 			"n_fft": 2048,
11 | 			"crop_start": 0,
12 | 			"crop_stop": 374,
13 | 			"lpf_start": 92,
14 | 			"lpf_stop": 186,
15 | 			"res_type": "polyphase"
16 | 		},
17 | 		"2": {
18 | 			"sr": 22050,
19 | 			"hl": 216,
20 | 			"n_fft": 1536,
21 | 			"crop_start": 0,
22 | 			"crop_stop": 424,
23 | 			"hpf_start": 68,
24 | 			"hpf_stop": 34,
25 | 			"lpf_start": 348,
26 | 			"lpf_stop": 418,
27 | 			"res_type": "polyphase"
28 | 		},	
29 | 		"3": {
30 | 			"sr": 44100,
31 | 			"hl": 432,
32 | 			"n_fft": 1280,
33 | 			"crop_start": 132,
34 | 			"crop_stop": 614,
35 | 			"hpf_start": 172,
36 | 			"hpf_stop": 144,
37 | 			"res_type": "polyphase"
38 | 		}
39 | 	},
40 | 	"sr": 44100,
41 | 	"pre_filter_start": 1280,
42 | 	"pre_filter_stop": 1280
43 | }


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import layers
  6 | from uvr5_pack.lib_v5 import spec_utils
  7 | 
  8 | 
  9 | class BaseASPPNet(nn.Module):
 10 | 
 11 |     def __init__(self, nin, ch, dilations=(4, 8, 16)):
 12 |         super(BaseASPPNet, self).__init__()
 13 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 14 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 15 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 16 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 17 | 
 18 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 19 | 
 20 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 21 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 22 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 23 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 24 | 
 25 |     def __call__(self, x):
 26 |         h, e1 = self.enc1(x)
 27 |         h, e2 = self.enc2(h)
 28 |         h, e3 = self.enc3(h)
 29 |         h, e4 = self.enc4(h)
 30 | 
 31 |         h = self.aspp(h)
 32 | 
 33 |         h = self.dec4(h, e4)
 34 |         h = self.dec3(h, e3)
 35 |         h = self.dec2(h, e2)
 36 |         h = self.dec1(h, e1)
 37 | 
 38 |         return h
 39 | 
 40 | 
 41 | class CascadedASPPNet(nn.Module):
 42 | 
 43 |     def __init__(self, n_fft):
 44 |         super(CascadedASPPNet, self).__init__()
 45 |         self.stg1_low_band_net = BaseASPPNet(2, 16)
 46 |         self.stg1_high_band_net = BaseASPPNet(2, 16)
 47 | 
 48 |         self.stg2_bridge = layers.Conv2DBNActiv(18, 8, 1, 1, 0)
 49 |         self.stg2_full_band_net = BaseASPPNet(8, 16)
 50 | 
 51 |         self.stg3_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0)
 52 |         self.stg3_full_band_net = BaseASPPNet(16, 32)
 53 | 
 54 |         self.out = nn.Conv2d(32, 2, 1, bias=False)
 55 |         self.aux1_out = nn.Conv2d(16, 2, 1, bias=False)
 56 |         self.aux2_out = nn.Conv2d(16, 2, 1, bias=False)
 57 | 
 58 |         self.max_bin = n_fft // 2
 59 |         self.output_bin = n_fft // 2 + 1
 60 | 
 61 |         self.offset = 128
 62 | 
 63 |     def forward(self, x, aggressiveness=None):
 64 |         mix = x.detach()
 65 |         x = x.clone()
 66 | 
 67 |         x = x[:, :, :self.max_bin]
 68 | 
 69 |         bandw = x.size()[2] // 2
 70 |         aux1 = torch.cat([
 71 |             self.stg1_low_band_net(x[:, :, :bandw]),
 72 |             self.stg1_high_band_net(x[:, :, bandw:])
 73 |         ], dim=2)
 74 | 
 75 |         h = torch.cat([x, aux1], dim=1)
 76 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 77 | 
 78 |         h = torch.cat([x, aux1, aux2], dim=1)
 79 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 80 | 
 81 |         mask = torch.sigmoid(self.out(h))
 82 |         mask = F.pad(
 83 |             input=mask,
 84 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 85 |             mode='replicate')
 86 |  
 87 |         if self.training:
 88 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 89 |             aux1 = F.pad(
 90 |                 input=aux1,
 91 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 92 |                 mode='replicate')
 93 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 94 |             aux2 = F.pad(
 95 |                 input=aux2,
 96 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 97 |                 mode='replicate')
 98 |             return mask * mix, aux1 * mix, aux2 * mix
 99 |         else:       
100 |             if aggressiveness:
101 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
102 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
103 | 
104 |             return mask * mix
105 | 
106 |     def predict(self, x_mag, aggressiveness=None):
107 |         h = self.forward(x_mag, aggressiveness)
108 | 
109 |         if self.offset > 0:
110 |             h = h[:, :, :, self.offset:-self.offset]
111 |             assert h.size()[3] > 0
112 | 
113 |         return h
114 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets_123812KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import layers_123821KB as layers
  6 | 
  7 | 
  8 | class BaseASPPNet(nn.Module):
  9 | 
 10 |     def __init__(self, nin, ch, dilations=(4, 8, 16)):
 11 |         super(BaseASPPNet, self).__init__()
 12 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 13 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 14 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 15 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 16 | 
 17 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 18 | 
 19 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 20 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 21 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 22 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 23 | 
 24 |     def __call__(self, x):
 25 |         h, e1 = self.enc1(x)
 26 |         h, e2 = self.enc2(h)
 27 |         h, e3 = self.enc3(h)
 28 |         h, e4 = self.enc4(h)
 29 | 
 30 |         h = self.aspp(h)
 31 | 
 32 |         h = self.dec4(h, e4)
 33 |         h = self.dec3(h, e3)
 34 |         h = self.dec2(h, e2)
 35 |         h = self.dec1(h, e1)
 36 | 
 37 |         return h
 38 | 
 39 | 
 40 | class CascadedASPPNet(nn.Module):
 41 | 
 42 |     def __init__(self, n_fft):
 43 |         super(CascadedASPPNet, self).__init__()
 44 |         self.stg1_low_band_net = BaseASPPNet(2, 32)
 45 |         self.stg1_high_band_net = BaseASPPNet(2, 32)
 46 | 
 47 |         self.stg2_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0)
 48 |         self.stg2_full_band_net = BaseASPPNet(16, 32)
 49 | 
 50 |         self.stg3_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0)
 51 |         self.stg3_full_band_net = BaseASPPNet(32, 64)
 52 | 
 53 |         self.out = nn.Conv2d(64, 2, 1, bias=False)
 54 |         self.aux1_out = nn.Conv2d(32, 2, 1, bias=False)
 55 |         self.aux2_out = nn.Conv2d(32, 2, 1, bias=False)
 56 | 
 57 |         self.max_bin = n_fft // 2
 58 |         self.output_bin = n_fft // 2 + 1
 59 | 
 60 |         self.offset = 128
 61 | 
 62 |     def forward(self, x, aggressiveness=None):
 63 |         mix = x.detach()
 64 |         x = x.clone()
 65 | 
 66 |         x = x[:, :, :self.max_bin]
 67 | 
 68 |         bandw = x.size()[2] // 2
 69 |         aux1 = torch.cat([
 70 |             self.stg1_low_band_net(x[:, :, :bandw]),
 71 |             self.stg1_high_band_net(x[:, :, bandw:])
 72 |         ], dim=2)
 73 | 
 74 |         h = torch.cat([x, aux1], dim=1)
 75 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 76 | 
 77 |         h = torch.cat([x, aux1, aux2], dim=1)
 78 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 79 | 
 80 |         mask = torch.sigmoid(self.out(h))
 81 |         mask = F.pad(
 82 |             input=mask,
 83 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 84 |             mode='replicate')
 85 |  
 86 |         if self.training:
 87 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 88 |             aux1 = F.pad(
 89 |                 input=aux1,
 90 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 91 |                 mode='replicate')
 92 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 93 |             aux2 = F.pad(
 94 |                 input=aux2,
 95 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 96 |                 mode='replicate')
 97 |             return mask * mix, aux1 * mix, aux2 * mix
 98 |         else:
 99 |             if aggressiveness:
100 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
101 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
102 | 
103 |             return mask * mix
104 | 
105 |     def predict(self, x_mag, aggressiveness=None):
106 |         h = self.forward(x_mag, aggressiveness)
107 | 
108 |         if self.offset > 0:
109 |             h = h[:, :, :, self.offset:-self.offset]
110 |             assert h.size()[3] > 0
111 | 
112 |         return h
113 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets_123821KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import layers_123821KB as layers
  6 | 
  7 | 
  8 | class BaseASPPNet(nn.Module):
  9 | 
 10 |     def __init__(self, nin, ch, dilations=(4, 8, 16)):
 11 |         super(BaseASPPNet, self).__init__()
 12 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 13 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 14 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 15 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 16 | 
 17 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 18 | 
 19 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 20 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 21 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 22 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 23 | 
 24 |     def __call__(self, x):
 25 |         h, e1 = self.enc1(x)
 26 |         h, e2 = self.enc2(h)
 27 |         h, e3 = self.enc3(h)
 28 |         h, e4 = self.enc4(h)
 29 | 
 30 |         h = self.aspp(h)
 31 | 
 32 |         h = self.dec4(h, e4)
 33 |         h = self.dec3(h, e3)
 34 |         h = self.dec2(h, e2)
 35 |         h = self.dec1(h, e1)
 36 | 
 37 |         return h
 38 | 
 39 | 
 40 | class CascadedASPPNet(nn.Module):
 41 | 
 42 |     def __init__(self, n_fft):
 43 |         super(CascadedASPPNet, self).__init__()
 44 |         self.stg1_low_band_net = BaseASPPNet(2, 32)
 45 |         self.stg1_high_band_net = BaseASPPNet(2, 32)
 46 | 
 47 |         self.stg2_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0)
 48 |         self.stg2_full_band_net = BaseASPPNet(16, 32)
 49 | 
 50 |         self.stg3_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0)
 51 |         self.stg3_full_band_net = BaseASPPNet(32, 64)
 52 | 
 53 |         self.out = nn.Conv2d(64, 2, 1, bias=False)
 54 |         self.aux1_out = nn.Conv2d(32, 2, 1, bias=False)
 55 |         self.aux2_out = nn.Conv2d(32, 2, 1, bias=False)
 56 | 
 57 |         self.max_bin = n_fft // 2
 58 |         self.output_bin = n_fft // 2 + 1
 59 | 
 60 |         self.offset = 128
 61 | 
 62 |     def forward(self, x, aggressiveness=None):
 63 |         mix = x.detach()
 64 |         x = x.clone()
 65 | 
 66 |         x = x[:, :, :self.max_bin]
 67 | 
 68 |         bandw = x.size()[2] // 2
 69 |         aux1 = torch.cat([
 70 |             self.stg1_low_band_net(x[:, :, :bandw]),
 71 |             self.stg1_high_band_net(x[:, :, bandw:])
 72 |         ], dim=2)
 73 | 
 74 |         h = torch.cat([x, aux1], dim=1)
 75 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 76 | 
 77 |         h = torch.cat([x, aux1, aux2], dim=1)
 78 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 79 | 
 80 |         mask = torch.sigmoid(self.out(h))
 81 |         mask = F.pad(
 82 |             input=mask,
 83 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 84 |             mode='replicate')
 85 |  
 86 |         if self.training:
 87 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 88 |             aux1 = F.pad(
 89 |                 input=aux1,
 90 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 91 |                 mode='replicate')
 92 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 93 |             aux2 = F.pad(
 94 |                 input=aux2,
 95 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 96 |                 mode='replicate')
 97 |             return mask * mix, aux1 * mix, aux2 * mix
 98 |         else:
 99 |             if aggressiveness:
100 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
101 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
102 | 
103 |             return mask * mix
104 | 
105 |     def predict(self, x_mag, aggressiveness=None):
106 |         h = self.forward(x_mag, aggressiveness)
107 | 
108 |         if self.offset > 0:
109 |             h = h[:, :, :, self.offset:-self.offset]
110 |             assert h.size()[3] > 0
111 | 
112 |         return h
113 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets_33966KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import layers_33966KB as layers
  6 | 
  7 | 
  8 | class BaseASPPNet(nn.Module):
  9 | 
 10 |     def __init__(self, nin, ch, dilations=(4, 8, 16, 32)):
 11 |         super(BaseASPPNet, self).__init__()
 12 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 13 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 14 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 15 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 16 | 
 17 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 18 | 
 19 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 20 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 21 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 22 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 23 | 
 24 |     def __call__(self, x):
 25 |         h, e1 = self.enc1(x)
 26 |         h, e2 = self.enc2(h)
 27 |         h, e3 = self.enc3(h)
 28 |         h, e4 = self.enc4(h)
 29 | 
 30 |         h = self.aspp(h)
 31 | 
 32 |         h = self.dec4(h, e4)
 33 |         h = self.dec3(h, e3)
 34 |         h = self.dec2(h, e2)
 35 |         h = self.dec1(h, e1)
 36 | 
 37 |         return h
 38 | 
 39 | 
 40 | class CascadedASPPNet(nn.Module):
 41 | 
 42 |     def __init__(self, n_fft):
 43 |         super(CascadedASPPNet, self).__init__()
 44 |         self.stg1_low_band_net = BaseASPPNet(2, 16)
 45 |         self.stg1_high_band_net = BaseASPPNet(2, 16)
 46 | 
 47 |         self.stg2_bridge = layers.Conv2DBNActiv(18, 8, 1, 1, 0)
 48 |         self.stg2_full_band_net = BaseASPPNet(8, 16)
 49 | 
 50 |         self.stg3_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0)
 51 |         self.stg3_full_band_net = BaseASPPNet(16, 32)
 52 | 
 53 |         self.out = nn.Conv2d(32, 2, 1, bias=False)
 54 |         self.aux1_out = nn.Conv2d(16, 2, 1, bias=False)
 55 |         self.aux2_out = nn.Conv2d(16, 2, 1, bias=False)
 56 | 
 57 |         self.max_bin = n_fft // 2
 58 |         self.output_bin = n_fft // 2 + 1
 59 | 
 60 |         self.offset = 128
 61 | 
 62 |     def forward(self, x, aggressiveness=None):
 63 |         mix = x.detach()
 64 |         x = x.clone()
 65 | 
 66 |         x = x[:, :, :self.max_bin]
 67 | 
 68 |         bandw = x.size()[2] // 2
 69 |         aux1 = torch.cat([
 70 |             self.stg1_low_band_net(x[:, :, :bandw]),
 71 |             self.stg1_high_band_net(x[:, :, bandw:])
 72 |         ], dim=2)
 73 | 
 74 |         h = torch.cat([x, aux1], dim=1)
 75 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 76 | 
 77 |         h = torch.cat([x, aux1, aux2], dim=1)
 78 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 79 | 
 80 |         mask = torch.sigmoid(self.out(h))
 81 |         mask = F.pad(
 82 |             input=mask,
 83 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 84 |             mode='replicate')
 85 | 
 86 |         if self.training:
 87 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 88 |             aux1 = F.pad(
 89 |                 input=aux1,
 90 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 91 |                 mode='replicate')
 92 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 93 |             aux2 = F.pad(
 94 |                 input=aux2,
 95 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 96 |                 mode='replicate')
 97 |             return mask * mix, aux1 * mix, aux2 * mix
 98 |         else:
 99 |             if aggressiveness:
100 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
101 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
102 | 
103 |             return mask * mix
104 | 
105 |     def predict(self, x_mag, aggressiveness=None):
106 |         h = self.forward(x_mag, aggressiveness)
107 | 
108 |         if self.offset > 0:
109 |             h = h[:, :, :, self.offset:-self.offset]
110 |             assert h.size()[3] > 0
111 | 
112 |         return h
113 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets_537227KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | from torch import nn
  4 | import torch.nn.functional as F
  5 | 
  6 | from uvr5_pack.lib_v5 import layers_537238KB as layers
  7 | 
  8 | 
  9 | class BaseASPPNet(nn.Module):
 10 | 
 11 |     def __init__(self, nin, ch, dilations=(4, 8, 16)):
 12 |         super(BaseASPPNet, self).__init__()
 13 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 14 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 15 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 16 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 17 | 
 18 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 19 | 
 20 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 21 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 22 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 23 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 24 | 
 25 |     def __call__(self, x):
 26 |         h, e1 = self.enc1(x)
 27 |         h, e2 = self.enc2(h)
 28 |         h, e3 = self.enc3(h)
 29 |         h, e4 = self.enc4(h)
 30 | 
 31 |         h = self.aspp(h)
 32 | 
 33 |         h = self.dec4(h, e4)
 34 |         h = self.dec3(h, e3)
 35 |         h = self.dec2(h, e2)
 36 |         h = self.dec1(h, e1)
 37 | 
 38 |         return h
 39 | 
 40 | 
 41 | class CascadedASPPNet(nn.Module):
 42 | 
 43 |     def __init__(self, n_fft):
 44 |         super(CascadedASPPNet, self).__init__()
 45 |         self.stg1_low_band_net = BaseASPPNet(2, 64)
 46 |         self.stg1_high_band_net = BaseASPPNet(2, 64)
 47 | 
 48 |         self.stg2_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0)
 49 |         self.stg2_full_band_net = BaseASPPNet(32, 64)
 50 | 
 51 |         self.stg3_bridge = layers.Conv2DBNActiv(130, 64, 1, 1, 0)
 52 |         self.stg3_full_band_net = BaseASPPNet(64, 128)
 53 | 
 54 |         self.out = nn.Conv2d(128, 2, 1, bias=False)
 55 |         self.aux1_out = nn.Conv2d(64, 2, 1, bias=False)
 56 |         self.aux2_out = nn.Conv2d(64, 2, 1, bias=False)
 57 | 
 58 |         self.max_bin = n_fft // 2
 59 |         self.output_bin = n_fft // 2 + 1
 60 | 
 61 |         self.offset = 128
 62 | 
 63 |     def forward(self, x, aggressiveness=None):
 64 |         mix = x.detach()
 65 |         x = x.clone()
 66 | 
 67 |         x = x[:, :, :self.max_bin]
 68 | 
 69 |         bandw = x.size()[2] // 2
 70 |         aux1 = torch.cat([
 71 |             self.stg1_low_band_net(x[:, :, :bandw]),
 72 |             self.stg1_high_band_net(x[:, :, bandw:])
 73 |         ], dim=2)
 74 | 
 75 |         h = torch.cat([x, aux1], dim=1)
 76 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 77 | 
 78 |         h = torch.cat([x, aux1, aux2], dim=1)
 79 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 80 | 
 81 |         mask = torch.sigmoid(self.out(h))
 82 |         mask = F.pad(
 83 |             input=mask,
 84 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 85 |             mode='replicate')
 86 |  
 87 |         if self.training:
 88 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 89 |             aux1 = F.pad(
 90 |                 input=aux1,
 91 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 92 |                 mode='replicate')
 93 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 94 |             aux2 = F.pad(
 95 |                 input=aux2,
 96 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 97 |                 mode='replicate')
 98 |             return mask * mix, aux1 * mix, aux2 * mix
 99 |         else:
100 |             if aggressiveness:
101 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
102 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
103 | 
104 |             return mask * mix
105 | 
106 |     def predict(self, x_mag, aggressiveness=None):
107 |         h = self.forward(x_mag, aggressiveness)
108 | 
109 |         if self.offset > 0:
110 |             h = h[:, :, :, self.offset:-self.offset]
111 |             assert h.size()[3] > 0
112 | 
113 |         return h
114 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets_537238KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | from torch import nn
  4 | import torch.nn.functional as F
  5 | 
  6 | from uvr5_pack.lib_v5 import layers_537238KB as layers
  7 | 
  8 | 
  9 | class BaseASPPNet(nn.Module):
 10 | 
 11 |     def __init__(self, nin, ch, dilations=(4, 8, 16)):
 12 |         super(BaseASPPNet, self).__init__()
 13 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 14 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 15 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 16 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 17 | 
 18 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 19 | 
 20 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 21 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 22 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 23 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 24 | 
 25 |     def __call__(self, x):
 26 |         h, e1 = self.enc1(x)
 27 |         h, e2 = self.enc2(h)
 28 |         h, e3 = self.enc3(h)
 29 |         h, e4 = self.enc4(h)
 30 | 
 31 |         h = self.aspp(h)
 32 | 
 33 |         h = self.dec4(h, e4)
 34 |         h = self.dec3(h, e3)
 35 |         h = self.dec2(h, e2)
 36 |         h = self.dec1(h, e1)
 37 | 
 38 |         return h
 39 | 
 40 | 
 41 | class CascadedASPPNet(nn.Module):
 42 | 
 43 |     def __init__(self, n_fft):
 44 |         super(CascadedASPPNet, self).__init__()
 45 |         self.stg1_low_band_net = BaseASPPNet(2, 64)
 46 |         self.stg1_high_band_net = BaseASPPNet(2, 64)
 47 | 
 48 |         self.stg2_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0)
 49 |         self.stg2_full_band_net = BaseASPPNet(32, 64)
 50 | 
 51 |         self.stg3_bridge = layers.Conv2DBNActiv(130, 64, 1, 1, 0)
 52 |         self.stg3_full_band_net = BaseASPPNet(64, 128)
 53 | 
 54 |         self.out = nn.Conv2d(128, 2, 1, bias=False)
 55 |         self.aux1_out = nn.Conv2d(64, 2, 1, bias=False)
 56 |         self.aux2_out = nn.Conv2d(64, 2, 1, bias=False)
 57 | 
 58 |         self.max_bin = n_fft // 2
 59 |         self.output_bin = n_fft // 2 + 1
 60 | 
 61 |         self.offset = 128
 62 | 
 63 |     def forward(self, x, aggressiveness=None):
 64 |         mix = x.detach()
 65 |         x = x.clone()
 66 | 
 67 |         x = x[:, :, :self.max_bin]
 68 | 
 69 |         bandw = x.size()[2] // 2
 70 |         aux1 = torch.cat([
 71 |             self.stg1_low_band_net(x[:, :, :bandw]),
 72 |             self.stg1_high_band_net(x[:, :, bandw:])
 73 |         ], dim=2)
 74 | 
 75 |         h = torch.cat([x, aux1], dim=1)
 76 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 77 | 
 78 |         h = torch.cat([x, aux1, aux2], dim=1)
 79 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 80 | 
 81 |         mask = torch.sigmoid(self.out(h))
 82 |         mask = F.pad(
 83 |             input=mask,
 84 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 85 |             mode='replicate')
 86 |  
 87 |         if self.training:
 88 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 89 |             aux1 = F.pad(
 90 |                 input=aux1,
 91 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 92 |                 mode='replicate')
 93 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 94 |             aux2 = F.pad(
 95 |                 input=aux2,
 96 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 97 |                 mode='replicate')
 98 |             return mask * mix, aux1 * mix, aux2 * mix
 99 |         else:
100 |             if aggressiveness:
101 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
102 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
103 | 
104 |             return mask * mix
105 | 
106 |     def predict(self, x_mag, aggressiveness=None):
107 |         h = self.forward(x_mag, aggressiveness)
108 | 
109 |         if self.offset > 0:
110 |             h = h[:, :, :, self.offset:-self.offset]
111 |             assert h.size()[3] > 0
112 | 
113 |         return h
114 | 


--------------------------------------------------------------------------------
/uvr5_pack/lib_v5/nets_61968KB.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | import torch.nn.functional as F
  4 | 
  5 | from uvr5_pack.lib_v5 import layers_123821KB as layers
  6 | 
  7 | 
  8 | class BaseASPPNet(nn.Module):
  9 | 
 10 |     def __init__(self, nin, ch, dilations=(4, 8, 16)):
 11 |         super(BaseASPPNet, self).__init__()
 12 |         self.enc1 = layers.Encoder(nin, ch, 3, 2, 1)
 13 |         self.enc2 = layers.Encoder(ch, ch * 2, 3, 2, 1)
 14 |         self.enc3 = layers.Encoder(ch * 2, ch * 4, 3, 2, 1)
 15 |         self.enc4 = layers.Encoder(ch * 4, ch * 8, 3, 2, 1)
 16 | 
 17 |         self.aspp = layers.ASPPModule(ch * 8, ch * 16, dilations)
 18 | 
 19 |         self.dec4 = layers.Decoder(ch * (8 + 16), ch * 8, 3, 1, 1)
 20 |         self.dec3 = layers.Decoder(ch * (4 + 8), ch * 4, 3, 1, 1)
 21 |         self.dec2 = layers.Decoder(ch * (2 + 4), ch * 2, 3, 1, 1)
 22 |         self.dec1 = layers.Decoder(ch * (1 + 2), ch, 3, 1, 1)
 23 | 
 24 |     def __call__(self, x):
 25 |         h, e1 = self.enc1(x)
 26 |         h, e2 = self.enc2(h)
 27 |         h, e3 = self.enc3(h)
 28 |         h, e4 = self.enc4(h)
 29 | 
 30 |         h = self.aspp(h)
 31 | 
 32 |         h = self.dec4(h, e4)
 33 |         h = self.dec3(h, e3)
 34 |         h = self.dec2(h, e2)
 35 |         h = self.dec1(h, e1)
 36 | 
 37 |         return h
 38 | 
 39 | 
 40 | class CascadedASPPNet(nn.Module):
 41 | 
 42 |     def __init__(self, n_fft):
 43 |         super(CascadedASPPNet, self).__init__()
 44 |         self.stg1_low_band_net = BaseASPPNet(2, 32)
 45 |         self.stg1_high_band_net = BaseASPPNet(2, 32)
 46 | 
 47 |         self.stg2_bridge = layers.Conv2DBNActiv(34, 16, 1, 1, 0)
 48 |         self.stg2_full_band_net = BaseASPPNet(16, 32)
 49 | 
 50 |         self.stg3_bridge = layers.Conv2DBNActiv(66, 32, 1, 1, 0)
 51 |         self.stg3_full_band_net = BaseASPPNet(32, 64)
 52 | 
 53 |         self.out = nn.Conv2d(64, 2, 1, bias=False)
 54 |         self.aux1_out = nn.Conv2d(32, 2, 1, bias=False)
 55 |         self.aux2_out = nn.Conv2d(32, 2, 1, bias=False)
 56 | 
 57 |         self.max_bin = n_fft // 2
 58 |         self.output_bin = n_fft // 2 + 1
 59 | 
 60 |         self.offset = 128
 61 | 
 62 |     def forward(self, x, aggressiveness=None):
 63 |         mix = x.detach()
 64 |         x = x.clone()
 65 | 
 66 |         x = x[:, :, :self.max_bin]
 67 | 
 68 |         bandw = x.size()[2] // 2
 69 |         aux1 = torch.cat([
 70 |             self.stg1_low_band_net(x[:, :, :bandw]),
 71 |             self.stg1_high_band_net(x[:, :, bandw:])
 72 |         ], dim=2)
 73 | 
 74 |         h = torch.cat([x, aux1], dim=1)
 75 |         aux2 = self.stg2_full_band_net(self.stg2_bridge(h))
 76 | 
 77 |         h = torch.cat([x, aux1, aux2], dim=1)
 78 |         h = self.stg3_full_band_net(self.stg3_bridge(h))
 79 | 
 80 |         mask = torch.sigmoid(self.out(h))
 81 |         mask = F.pad(
 82 |             input=mask,
 83 |             pad=(0, 0, 0, self.output_bin - mask.size()[2]),
 84 |             mode='replicate')
 85 |  
 86 |         if self.training:
 87 |             aux1 = torch.sigmoid(self.aux1_out(aux1))
 88 |             aux1 = F.pad(
 89 |                 input=aux1,
 90 |                 pad=(0, 0, 0, self.output_bin - aux1.size()[2]),
 91 |                 mode='replicate')
 92 |             aux2 = torch.sigmoid(self.aux2_out(aux2))
 93 |             aux2 = F.pad(
 94 |                 input=aux2,
 95 |                 pad=(0, 0, 0, self.output_bin - aux2.size()[2]),
 96 |                 mode='replicate')
 97 |             return mask * mix, aux1 * mix, aux2 * mix
 98 |         else:
 99 |             if aggressiveness:
100 |                 mask[:, :, :aggressiveness['split_bin']] = torch.pow(mask[:, :, :aggressiveness['split_bin']], 1 + aggressiveness['value'] / 3)
101 |                 mask[:, :, aggressiveness['split_bin']:] = torch.pow(mask[:, :, aggressiveness['split_bin']:], 1 + aggressiveness['value'])
102 | 
103 |             return mask * mix
104 | 
105 |     def predict(self, x_mag, aggressiveness=None):
106 |         h = self.forward(x_mag, aggressiveness)
107 | 
108 |         if self.offset > 0:
109 |             h = h[:, :, :, self.offset:-self.offset]
110 |             assert h.size()[3] > 0
111 | 
112 |         return h
113 | 


--------------------------------------------------------------------------------
/uvr5_pack/utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | from tqdm import tqdm
  4 | 
  5 | def make_padding(width, cropsize, offset):
  6 |     left = offset
  7 |     roi_size = cropsize - left * 2
  8 |     if roi_size == 0:
  9 |         roi_size = cropsize
 10 |     right = roi_size - (width % roi_size) + left
 11 | 
 12 |     return left, right, roi_size
 13 | def inference(X_spec, device, model, aggressiveness,data):
 14 |     '''
 15 |     data ： dic configs
 16 |     '''
 17 |     
 18 |     def _execute(X_mag_pad, roi_size, n_window, device, model, aggressiveness,is_half=True):
 19 |         model.eval()
 20 |         with torch.no_grad():
 21 |             preds = []
 22 |             
 23 |             iterations = [n_window]
 24 | 
 25 |             total_iterations = sum(iterations)            
 26 |             for i in tqdm(range(n_window)): 
 27 |                 start = i * roi_size
 28 |                 X_mag_window = X_mag_pad[None, :, :, start:start + data['window_size']]
 29 |                 X_mag_window = torch.from_numpy(X_mag_window)
 30 |                 if(is_half==True):X_mag_window=X_mag_window.half()
 31 |                 X_mag_window=X_mag_window.to(device)
 32 | 
 33 |                 pred = model.predict(X_mag_window, aggressiveness)
 34 | 
 35 |                 pred = pred.detach().cpu().numpy()
 36 |                 preds.append(pred[0])
 37 |                 
 38 |             pred = np.concatenate(preds, axis=2)
 39 |         return pred
 40 |     
 41 |     def preprocess(X_spec):
 42 |         X_mag = np.abs(X_spec)
 43 |         X_phase = np.angle(X_spec)
 44 | 
 45 |         return X_mag, X_phase
 46 |     
 47 |     X_mag, X_phase = preprocess(X_spec)
 48 | 
 49 |     coef = X_mag.max()
 50 |     X_mag_pre = X_mag / coef
 51 | 
 52 |     n_frame = X_mag_pre.shape[2]
 53 |     pad_l, pad_r, roi_size = make_padding(n_frame,
 54 |                                                 data['window_size'], model.offset)
 55 |     n_window = int(np.ceil(n_frame / roi_size))
 56 | 
 57 |     X_mag_pad = np.pad(
 58 |         X_mag_pre, ((0, 0), (0, 0), (pad_l, pad_r)), mode='constant')
 59 | 
 60 |     if(list(model.state_dict().values())[0].dtype==torch.float16):is_half=True
 61 |     else:is_half=False
 62 |     pred = _execute(X_mag_pad, roi_size, n_window,
 63 |                         device, model, aggressiveness,is_half)
 64 |     pred = pred[:, :, :n_frame]
 65 |     
 66 |     if data['tta']:
 67 |         pad_l += roi_size // 2
 68 |         pad_r += roi_size // 2
 69 |         n_window += 1
 70 | 
 71 |         X_mag_pad = np.pad(
 72 |             X_mag_pre, ((0, 0), (0, 0), (pad_l, pad_r)), mode='constant')
 73 | 
 74 |         pred_tta = _execute(X_mag_pad, roi_size, n_window,
 75 |                                 device, model, aggressiveness,is_half)
 76 |         pred_tta = pred_tta[:, :, roi_size // 2:]
 77 |         pred_tta = pred_tta[:, :, :n_frame]
 78 | 
 79 |         return (pred + pred_tta) * 0.5 * coef, X_mag, np.exp(1.j * X_phase)
 80 |     else:
 81 |         return pred * coef, X_mag, np.exp(1.j * X_phase)
 82 |             
 83 | 
 84 | 
 85 | def  _get_name_params(model_path , model_hash):
 86 |     ModelName = model_path
 87 |     if model_hash == '47939caf0cfe52a0e81442b85b971dfd':  
 88 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
 89 |         param_name_auto=str('4band_44100')
 90 |     if model_hash == '4e4ecb9764c50a8c414fee6e10395bbe':  
 91 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2.json')
 92 |         param_name_auto=str('4band_v2')
 93 |     if model_hash == 'ca106edd563e034bde0bdec4bb7a4b36':
 94 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2.json')
 95 |         param_name_auto=str('4band_v2')
 96 |     if model_hash == 'e60a1e84803ce4efc0a6551206cc4b71':
 97 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
 98 |         param_name_auto=str('4band_44100')
 99 |     if model_hash == 'a82f14e75892e55e994376edbf0c8435':  
100 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
101 |         param_name_auto=str('4band_44100')
102 |     if model_hash == '6dd9eaa6f0420af9f1d403aaafa4cc06':   
103 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2_sn.json')
104 |         param_name_auto=str('4band_v2_sn')
105 |     if model_hash == '08611fb99bd59eaa79ad27c58d137727':
106 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2_sn.json')
107 |         param_name_auto=str('4band_v2_sn')
108 |     if model_hash == '5c7bbca45a187e81abbbd351606164e5':
109 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json')
110 |         param_name_auto=str('3band_44100_msb2')
111 |     if model_hash == 'd6b2cb685a058a091e5e7098192d3233':    
112 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json')
113 |         param_name_auto=str('3band_44100_msb2')
114 |     if model_hash == 'c1b9f38170a7c90e96f027992eb7c62b': 
115 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
116 |         param_name_auto=str('4band_44100')
117 |     if model_hash == 'c3448ec923fa0edf3d03a19e633faa53':  
118 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
119 |         param_name_auto=str('4band_44100')
120 |     if model_hash == '68aa2c8093d0080704b200d140f59e54':  
121 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100.json')
122 |         param_name_auto=str('3band_44100.json')
123 |     if model_hash == 'fdc83be5b798e4bd29fe00fe6600e147':  
124 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_mid.json')
125 |         param_name_auto=str('3band_44100_mid.json')
126 |     if model_hash == '2ce34bc92fd57f55db16b7a4def3d745':  
127 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_mid.json')
128 |         param_name_auto=str('3band_44100_mid.json')
129 |     if model_hash == '52fdca89576f06cf4340b74a4730ee5f':  
130 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
131 |         param_name_auto=str('4band_44100.json')
132 |     if model_hash == '41191165b05d38fc77f072fa9e8e8a30':  
133 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
134 |         param_name_auto=str('4band_44100.json')
135 |     if model_hash == '89e83b511ad474592689e562d5b1f80e':  
136 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_32000.json')
137 |         param_name_auto=str('2band_32000.json')
138 |     if model_hash == '0b954da81d453b716b114d6d7c95177f':  
139 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_32000.json')
140 |         param_name_auto=str('2band_32000.json')
141 | 
142 |     #v4 Models    
143 |     if model_hash == '6a00461c51c2920fd68937d4609ed6c8':  
144 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr16000_hl512.json')
145 |         param_name_auto=str('1band_sr16000_hl512')
146 |     if model_hash == '0ab504864d20f1bd378fe9c81ef37140':  
147 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json')
148 |         param_name_auto=str('1band_sr32000_hl512')
149 |     if model_hash == '7dd21065bf91c10f7fccb57d7d83b07f':  
150 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json')
151 |         param_name_auto=str('1band_sr32000_hl512')
152 |     if model_hash == '80ab74d65e515caa3622728d2de07d23':  
153 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json')
154 |         param_name_auto=str('1band_sr32000_hl512')
155 |     if model_hash == 'edc115e7fc523245062200c00caa847f':  
156 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json')
157 |         param_name_auto=str('1band_sr33075_hl384')
158 |     if model_hash == '28063e9f6ab5b341c5f6d3c67f2045b7':  
159 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json')
160 |         param_name_auto=str('1band_sr33075_hl384')
161 |     if model_hash == 'b58090534c52cbc3e9b5104bad666ef2':  
162 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json')
163 |         param_name_auto=str('1band_sr44100_hl512')
164 |     if model_hash == '0cdab9947f1b0928705f518f3c78ea8f':  
165 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json')
166 |         param_name_auto=str('1band_sr44100_hl512')
167 |     if model_hash == 'ae702fed0238afb5346db8356fe25f13':  
168 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl1024.json')
169 |         param_name_auto=str('1band_sr44100_hl1024')                        
170 |     #User Models
171 | 
172 |     #1 Band
173 |     if '1band_sr16000_hl512' in ModelName:  
174 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr16000_hl512.json')
175 |         param_name_auto=str('1band_sr16000_hl512')
176 |     if '1band_sr32000_hl512' in ModelName:  
177 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr32000_hl512.json')
178 |         param_name_auto=str('1band_sr32000_hl512')
179 |     if '1band_sr33075_hl384' in ModelName:  
180 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr33075_hl384.json')
181 |         param_name_auto=str('1band_sr33075_hl384')
182 |     if '1band_sr44100_hl256' in ModelName:  
183 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl256.json')
184 |         param_name_auto=str('1band_sr44100_hl256')
185 |     if '1band_sr44100_hl512' in ModelName:  
186 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl512.json')
187 |         param_name_auto=str('1band_sr44100_hl512')
188 |     if '1band_sr44100_hl1024' in ModelName:  
189 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/1band_sr44100_hl1024.json')
190 |         param_name_auto=str('1band_sr44100_hl1024')
191 |         
192 |     #2 Band
193 |     if '2band_44100_lofi' in ModelName:  
194 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_44100_lofi.json')
195 |         param_name_auto=str('2band_44100_lofi')
196 |     if '2band_32000' in ModelName:  
197 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_32000.json')
198 |         param_name_auto=str('2band_32000')
199 |     if '2band_48000' in ModelName:  
200 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/2band_48000.json')
201 |         param_name_auto=str('2band_48000')
202 |         
203 |     #3 Band   
204 |     if '3band_44100' in ModelName:  
205 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100.json')
206 |         param_name_auto=str('3band_44100')
207 |     if '3band_44100_mid' in ModelName:  
208 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_mid.json')
209 |         param_name_auto=str('3band_44100_mid')
210 |     if '3band_44100_msb2' in ModelName:  
211 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/3band_44100_msb2.json')
212 |         param_name_auto=str('3band_44100_msb2')
213 |         
214 |     #4 Band    
215 |     if '4band_44100' in ModelName:  
216 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100.json')
217 |         param_name_auto=str('4band_44100')
218 |     if '4band_44100_mid' in ModelName:  
219 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_mid.json')
220 |         param_name_auto=str('4band_44100_mid')
221 |     if '4band_44100_msb' in ModelName:  
222 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_msb.json')
223 |         param_name_auto=str('4band_44100_msb')
224 |     if '4band_44100_msb2' in ModelName:  
225 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_msb2.json')
226 |         param_name_auto=str('4band_44100_msb2')
227 |     if '4band_44100_reverse' in ModelName:  
228 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_reverse.json')
229 |         param_name_auto=str('4band_44100_reverse')
230 |     if '4band_44100_sw' in ModelName:  
231 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_44100_sw.json') 
232 |         param_name_auto=str('4band_44100_sw')
233 |     if '4band_v2' in ModelName:  
234 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2.json')
235 |         param_name_auto=str('4band_v2')
236 |     if '4band_v2_sn' in ModelName:  
237 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/4band_v2_sn.json')
238 |         param_name_auto=str('4band_v2_sn')
239 |     if 'tmodelparam' in ModelName:  
240 |         model_params_auto=str('uvr5_pack/lib_v5/modelparams/tmodelparam.json')
241 |         param_name_auto=str('User Model Param Set')
242 |     return param_name_auto , model_params_auto
243 | 


--------------------------------------------------------------------------------
/uvr5_weights/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/vc_infer_pipeline.py:
--------------------------------------------------------------------------------
  1 | import numpy as np,parselmouth,torch,pdb
  2 | from time import time as ttime
  3 | import torch.nn.functional as F
  4 | from config import x_pad,x_query,x_center,x_max
  5 | import scipy.signal as signal
  6 | import pyworld,os,traceback,faiss
  7 | class VC(object):
  8 |     def __init__(self,tgt_sr,device,is_half):
  9 |         self.sr=16000#hubert输入采样率
 10 |         self.window=160#每帧点数
 11 |         self.t_pad=self.sr*x_pad#每条前后pad时间
 12 |         self.t_pad_tgt=tgt_sr*x_pad
 13 |         self.t_pad2=self.t_pad*2
 14 |         self.t_query=self.sr*x_query#查询切点前后查询时间
 15 |         self.t_center=self.sr*x_center#查询切点位置
 16 |         self.t_max=self.sr*x_max#免查询时长阈值
 17 |         self.device=device
 18 |         self.is_half=is_half
 19 | 
 20 |     def get_f0(self,x, p_len,f0_up_key,f0_method,inp_f0=None):
 21 |         time_step = self.window / self.sr * 1000
 22 |         f0_min = 50
 23 |         f0_max = 1100
 24 |         f0_mel_min = 1127 * np.log(1 + f0_min / 700)
 25 |         f0_mel_max = 1127 * np.log(1 + f0_max / 700)
 26 |         if(f0_method=="pm"):
 27 |             f0 = parselmouth.Sound(x, self.sr).to_pitch_ac(
 28 |                 time_step=time_step / 1000, voicing_threshold=0.6,
 29 |                 pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
 30 |             pad_size=(p_len - len(f0) + 1) // 2
 31 |             if(pad_size>0 or p_len - len(f0) - pad_size>0):
 32 |                 f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant')
 33 |         elif(f0_method=="harvest"):
 34 |             f0, t = pyworld.harvest(
 35 |                 x.astype(np.double),
 36 |                 fs=self.sr,
 37 |                 f0_ceil=f0_max,
 38 |                 frame_period=10,
 39 |             )
 40 |             f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
 41 |             f0 = signal.medfilt(f0, 3)
 42 |         f0 *= pow(2, f0_up_key / 12)
 43 |         # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
 44 |         tf0=self.sr//self.window#每秒f0点数
 45 |         if (inp_f0 is not None):
 46 |             delta_t=np.round((inp_f0[:,0].max()-inp_f0[:,0].min())*tf0+1).astype("int16")
 47 |             replace_f0=np.interp(list(range(delta_t)), inp_f0[:, 0]*100, inp_f0[:, 1])
 48 |             shape=f0[x_pad*tf0:x_pad*tf0+len(replace_f0)].shape[0]
 49 |             f0[x_pad*tf0:x_pad*tf0+len(replace_f0)]=replace_f0[:shape]
 50 |         # with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
 51 |         f0bak = f0.copy()
 52 |         f0_mel = 1127 * np.log(1 + f0 / 700)
 53 |         f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (f0_mel_max - f0_mel_min) + 1
 54 |         f0_mel[f0_mel <= 1] = 1
 55 |         f0_mel[f0_mel > 255] = 255
 56 |         f0_coarse = np.rint(f0_mel).astype(np.int)
 57 |         return f0_coarse, f0bak#1-0
 58 | 
 59 |     def vc(self,model,net_g,sid,audio0,pitch,pitchf,times,index,big_npy,index_rate):#,file_index,file_big_npy
 60 |         feats = torch.from_numpy(audio0)
 61 |         if(self.is_half==True):feats=feats.half()
 62 |         else:feats=feats.float()
 63 |         if feats.dim() == 2:  # double channels
 64 |             feats = feats.mean(-1)
 65 |         assert feats.dim() == 1, feats.dim()
 66 |         feats = feats.view(1, -1)
 67 |         padding_mask = torch.BoolTensor(feats.shape).to(self.device).fill_(False)
 68 | 
 69 |         inputs = {
 70 |             "source": feats.to(self.device),
 71 |             "padding_mask": padding_mask,
 72 |             "output_layer": 9,  # layer 9
 73 |         }
 74 |         t0 = ttime()
 75 |         with torch.no_grad():
 76 |             logits = model.extract_features(**inputs)
 77 |             feats  = model.final_proj(logits[0])
 78 | 
 79 |         if(isinstance(index,type(None))==False and isinstance(big_npy,type(None))==False and index_rate!=0):
 80 |             npy = feats[0].cpu().numpy()
 81 |             if(self.is_half==True):npy=npy.astype("float32")
 82 |             _, I = index.search(npy, 1)
 83 |             npy=big_npy[I.squeeze()]
 84 |             if(self.is_half==True):npy=npy.astype("float16")
 85 |             feats = torch.from_numpy(npy).unsqueeze(0).to(self.device)*index_rate + (1-index_rate)*feats
 86 | 
 87 |         feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1)
 88 |         t1 = ttime()
 89 |         p_len = audio0.shape[0]//self.window
 90 |         if(feats.shape[1]<p_len):
 91 |             p_len=feats.shape[1]
 92 |             if(pitch!=None and pitchf!=None):
 93 |                 pitch=pitch[:,:p_len]
 94 |                 pitchf=pitchf[:,:p_len]
 95 |         p_len=torch.tensor([p_len],device=self.device).long()
 96 |         with torch.no_grad():
 97 |             if(pitch!=None and pitchf!=None):
 98 |                 audio1 = (net_g.infer(feats, p_len, pitch, pitchf, sid)[0][0, 0] * 32768).data.cpu().float().numpy().astype(np.int16)
 99 |             else:
100 |                 audio1 = (net_g.infer(feats, p_len, sid)[0][0, 0] * 32768).data.cpu().float().numpy().astype(np.int16)
101 |         del feats,p_len,padding_mask
102 |         if torch.cuda.is_available(): torch.cuda.empty_cache()
103 |         t2 = ttime()
104 |         times[0] += (t1 - t0)
105 |         times[2] += (t2 - t1)
106 |         return audio1
107 | 
108 |     def pipeline(self,model,net_g,sid,audio,times,f0_up_key,f0_method,file_index,file_big_npy,index_rate,if_f0,f0_file=None):
109 |         if(file_big_npy!=""and file_index!=""and os.path.exists(file_big_npy)==True and os.path.exists(file_index)==True and index_rate!=0):
110 |             try:
111 |                 index = faiss.read_index(file_index)
112 |                 big_npy = np.load(file_big_npy)
113 |             except:
114 |                 traceback.print_exc()
115 |                 index=big_npy=None
116 |         else:
117 |             index=big_npy=None
118 |         audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode='reflect')
119 |         opt_ts = []
120 |         if(audio_pad.shape[0]>self.t_max):
121 |             audio_sum = np.zeros_like(audio)
122 |             for i in range(self.window): audio_sum += audio_pad[i:i - self.window]
123 |             for t in range(self.t_center, audio.shape[0],self.t_center):opt_ts.append(t - self.t_query + np.where(np.abs(audio_sum[t - self.t_query:t + self.t_query]) == np.abs(audio_sum[t - self.t_query:t + self.t_query]).min())[0][0])
124 |         s = 0
125 |         audio_opt=[]
126 |         t=None
127 |         t1=ttime()
128 |         audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode='reflect')
129 |         p_len=audio_pad.shape[0]//self.window
130 |         inp_f0=None
131 |         if(hasattr(f0_file,'name') ==True):
132 |             try:
133 |                 with open(f0_file.name,"r")as f:
134 |                     lines=f.read().strip("\n").split("\n")
135 |                 inp_f0=[]
136 |                 for line in lines:inp_f0.append([float(i)for i in line.split(",")])
137 |                 inp_f0=np.array(inp_f0,dtype="float32")
138 |             except:
139 |                 traceback.print_exc()
140 |         sid=torch.tensor(sid,device=self.device).unsqueeze(0).long()
141 |         pitch, pitchf=None,None
142 |         if(if_f0==1):
143 |             pitch, pitchf = self.get_f0(audio_pad, p_len, f0_up_key,f0_method,inp_f0)
144 |             pitch = pitch[:p_len]
145 |             pitchf = pitchf[:p_len]
146 |             pitch = torch.tensor(pitch,device=self.device).unsqueeze(0).long()
147 |             pitchf = torch.tensor(pitchf,device=self.device).unsqueeze(0).float()
148 |         t2=ttime()
149 |         times[1] += (t2 - t1)
150 |         for t in opt_ts:
151 |             t=t//self.window*self.window
152 |             if (if_f0 == 1):
153 |                 audio_opt.append(self.vc(model,net_g,sid,audio_pad[s:t+self.t_pad2+self.window],pitch[:,s//self.window:(t+self.t_pad2)//self.window],pitchf[:,s//self.window:(t+self.t_pad2)//self.window],times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
154 |             else:
155 |                 audio_opt.append(self.vc(model,net_g,sid,audio_pad[s:t+self.t_pad2+self.window],None,None,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
156 |             s = t
157 |         if (if_f0 == 1):
158 |             audio_opt.append(self.vc(model,net_g,sid,audio_pad[t:],pitch[:,t//self.window:]if t is not None else pitch,pitchf[:,t//self.window:]if t is not None else pitchf,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
159 |         else:
160 |             audio_opt.append(self.vc(model,net_g,sid,audio_pad[t:],None,None,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
161 |         audio_opt=np.concatenate(audio_opt)
162 |         del pitch,pitchf,sid
163 |         if torch.cuda.is_available(): torch.cuda.empty_cache()
164 |         return audio_opt
165 | 


--------------------------------------------------------------------------------
/weights/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/使用需遵守的协议-LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 liujing04
 4 | Copyright (c) 2023 源文雨
 5 | 
 6 |         本软件及其相关代码以MIT协议开源，作者不对软件具备任何控制力，使用软件者、传播软件导出的声音者自负全责。
 7 |         如不认可该条款，则不能使用或引用软件包内任何代码和文件。
 8 | 
 9 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
10 | 
11 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
14 | 
15 | 特此授予任何获得本软件和相关文档文件（以下简称“软件”）副本的人免费使用、复制、修改、合并、出版、分发、再授权和/或销售本软件的权利，以及授予本软件所提供的人使用本软件的权利，但须符合以下条件：
16 | 上述版权声明和本许可声明应包含在软件的所有副本或实质部分中。
17 | 软件是“按原样”提供的，没有任何明示或暗示的保证，包括但不限于适销性、适用于特定目的和不侵权的保证。在任何情况下，作者或版权持有人均不承担因软件或软件的使用或其他交易而产生、产生或与之相关的任何索赔、损害赔偿或其他责任，无论是在合同诉讼、侵权诉讼还是其他诉讼中。
18 | 
19 | 相关引用库协议如下：
20 | #################
21 | ContentVec
22 | https://github.com/auspicious3000/contentvec/blob/main/LICENSE
23 | MIT License
24 | #################
25 | VITS
26 | https://github.com/jaywalnut310/vits/blob/main/LICENSE
27 | MIT License
28 | #################
29 | HIFIGAN
30 | https://github.com/jik876/hifi-gan/blob/master/LICENSE
31 | MIT License
32 | #################
33 | gradio
34 | https://github.com/gradio-app/gradio/blob/main/LICENSE
35 | Apache License 2.0
36 | #################
37 | ffmpeg
38 | https://github.com/FFmpeg/FFmpeg/blob/master/COPYING.LGPLv3
39 | https://github.com/BtbN/FFmpeg-Builds/releases/download/autobuild-2021-02-28-12-32/ffmpeg-n4.3.2-160-gfbb9368226-win64-lgpl-4.3.zip
40 | LPGLv3 License
41 | MIT License
42 | #################
43 | ultimatevocalremovergui
44 | https://github.com/Anjok07/ultimatevocalremovergui/blob/master/LICENSE
45 | https://github.com/yang123qwe/vocal_separation_by_uvr5
46 | MIT License
47 | #################
48 | audio-slicer
49 | https://github.com/openvpi/audio-slicer/blob/main/LICENSE
50 | MIT License
51 | 


--------------------------------------------------------------------------------
/小白简易教程.doc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yantaisa11/Retrieval-based-Voice-Conversion-WebUI-JP-localization/106f3ac45e1a1206489b32500f1fd66ad1744339/小白简易教程.doc


--------------------------------------------------------------------------------