├── .DS_Store ├── .gitignore ├── LICENSE ├── README.md ├── requirements.txt └── utils.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dubverse-ai/rvc-data-prep/ab8940bc668f178f75d3bdba4084d603dec5db19/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /cache 2 | /env -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Dubverse 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 |

RVC Data Prep: An Open-Source RVC Data Preparation Tool

4 | a Dubverse Black initiative

5 | 6 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NA2GuJ2y-zRfoG3NearNiMCQa8NbidSh?usp=sharing) 7 | [![Discord Shield](https://discordapp.com/api/guilds/1162007551987171410/widget.png?style=shield)](https://discord.gg/4VGnrgpBN) 8 |
9 | 10 | ------ 11 | 12 | ## Description 13 | RVC Data Prep is an advanced tool for transforming audio/video content into isolated vocals. If a video contains multiple speakers, it will generate separate files for each one. The core functionality leverages Facebook's Demucs to isolate vocals and Pyannote embeddings to ideally identify and differentiate speakers. 14 | 15 | ## Features 16 | 1. Isolate vocals from YouTube videos 17 | 2. Distinguish multiple speakers and provide separate files 18 | 3. Trim silences greated than 300ms from the audio 19 | 4. (Beta) Separate multi-singer Acapellas 20 | 21 | ## Prerequisites 22 | Before you start using this tool, ensure that you have the following installed: 23 | - Python version 3.10 or newer 24 | - Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions 25 | - Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote/speaker-diarization-3.0) user conditions 26 | - Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens) 27 | 28 | ## How to use 29 | 30 | Clone the repository 31 | 32 | ``` bash 33 | git clone https://github.com/dubverse-ai/rvc-data-prep.git 34 | ``` 35 | 36 | Change working directory, install dependencies and import the utils.py script 37 | 38 | ``` bash 39 | cd rvc-data-prep 40 | pip install -r requirements.txt 41 | ``` 42 | 43 | The `clean` function in `utils.py` provides automatic processing of a given file (wav, mp3 and flac only). You need to specify different parameters depending on your needs. 44 | Parameters: 45 | 46 | - `local` _(bool)_: Set this to `True` if you intend to give a file locally; `False` if you intend to create a dataset from a YouTube link. 47 | 48 | - `file_path` _(str)_: This should be either a local path or a YouTube URL file depending on what you set `local` to be. 49 | 50 | - `project_name` _(str)_: This will be the name of the project which the processed file will be saved under. 51 | 52 | - `acapella_output` _(bool)_: (BETA) If this is `True`, the function insert blank audio segments while separating and segregrating speakers. The output files will add up in the time domain to create the original file. 53 | 54 | - `single_speaker_file` _(bool)_: If `True`, this will flag the file as having a single speaker. 55 | 56 | - `token` _(str)_: It is client secret key or token of your Hugging Face account. You would only need this if you're working with files involving multiple speakers. You can leave this blank in that case. 57 | 58 | Here is an example to use the clean function: 59 | 60 | ```python 61 | from utils import clean 62 | 63 | clean(local=False, 64 | file_path="https://www.youtube.com/watch?v=someVideoId", 65 | project_name="myProject", 66 | acapella_output=True, 67 | token="yourToken", 68 | single_speaker_file=False) 69 | ``` 70 | In this example, we are providing a YouTube video url to `file_path`, setting the `project_name` as "myProject", and requesting for an acapella output by setting `acapella_output` to `True`. We indicate there may be more than one speaker by setting `single_speaker_file` to `False`, and pass our account token as `token`. 71 | 72 | ## YouTube Tutorial 73 | [![YOUTUBE TUTORIAL](https://img.youtube.com/vi/QLQ8eSGZDi8/0.jpg)](https://www.youtube.com/watch?v=QLQ8eSGZDi8) 74 | 75 | 76 | ## Examples 77 | | **Input Video** | **Separated Files** | 78 | |---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 79 | | [Shahrukh Khan's Speech](https://www.youtube.com/shorts/tsgWNmVU_B0) | [Vocals](https://dl.sndup.net/qhp6/srk-cleaned.mp3) | 80 | | [Yeh Ladka Haaye Allah - Bollywood Song](https://www.youtube.com/watch?v=BE8_rNJOQ-0) | [Udit Narayan's Vocals](https://dl.sndup.net/rqmp/SPEAKER_00.mp3), [Alka Yagnik's Vocals](https://dl.sndup.net/rg4g/SPEAKER_02.mp3), [Chorous](https://dl.sndup.net/d8s9/SPEAKER_01.mp3), [Other ambigous sounds](https://dl.sndup.net/wd2y/SPEAKER_03.mp3) | 81 | | [Perfect - Ed Sheeran Duet](https://www.youtube.com/watch?v=817P8W8-mGE) | [Ed Sheeran's Vocals](https://dl.sndup.net/gmjf/perfect.mp3), [Beyonce's Vocals](https://dl.sndup.net/h4qs/perfectf.mp3) | 82 | 83 | ## Known Issues 84 | * Messes up when there are multiple people speaking at the same time 85 | * When using `acapella = True`, this sometimes skips some audio segments which makes it hard to sync manually. 86 | 87 | ## Contributing 88 | We welcome contributions from anyone and everyone. Details about how to contribute, what we are looking for and how to get started can be found in our contributing guidelines. 89 | 90 | ## Support 91 | For any issues, queries, and suggestions, join our [Discord server](https://discord.gg/4VGnrgpBN). Will be glad to help! 92 | 93 | ## Future Scope 94 | - Add multispeaker Acapella support 95 | - Integrate this in the RVC workflow - base data preparation and creating AI covers 96 | - Improve the efficiencies of speaker identification using other models like Titanet 97 | 98 | ## About Us 99 | We, at **Dubverse.ai**, are a dedicated and passionate group of developers who have been working for over three years on generative AI with a specific emphasis on audio. We deeply believe in the potential of AI to revolutionize the fields of video, voiceover, podcasts and other media-related applications. 100 | 101 | Our passion and dedication don't stop at development. We believe in sharing knowledge and nurturing a community of like-minded enthusiasts. That's why we maintain a deep tech blog where we talk about our latest research, development, trends in the field, and insights about generative AI and audio technologies. 102 | 103 | Check out some of our RVC blog posts: 104 | 105 | 1. [Evals are all we need](https://black.dubverse.ai/p/evals-are-all-we-need) 106 | 2. [Running RVC Models on the Easy GUI](https://black.dubverse.ai/p/running-rvc-models-on-the-easy-gui) 107 | 108 | We are always open to hear from others who share our passion. Whether you're an expert in the field, a hobbyist, or just someone intrigued by AI and audio, feel free to reach out and connect with us. 109 | 110 | 111 | ## License 112 | RVC Data Prep is licensed under the MIT License - see the [LICENSE](LICENSE) file for details 113 | 114 | *Disclaimer: This repo is not affiliated with YouTube, Facebook AI Research, or Pyannote. All trademarks referred to are the property of their respective owners.* 115 | 116 | ## Acknowledgements 117 | 1. FaceBook Demucs, Pyannote Audio, Librosa, FFMPEG, and other audio related libraries. 118 | 2. The Dubverse Black Discord and the AI Hub Discord for quick and actionable feedback. 119 | 120 | ----------------------------------------------------------------------------- 121 | 122 | We value your feedback and encourage you to provide us with any suggestions or issues that you may encounter. Let's make this tool better together! 123 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.9.1 2 | aiosignal==1.3.1 3 | alembic==1.12.1 4 | antlr4-python3-runtime==4.9.3 5 | asteroid-filterbanks==0.4.0 6 | async-timeout==4.0.3 7 | attrs==23.1.0 8 | audioread==3.0.1 9 | Brotli==1.1.0 10 | certifi==2023.11.17 11 | cffi==1.16.0 12 | charset-normalizer==3.3.2 13 | click==8.1.7 14 | cloudpickle==3.0.0 15 | colorama==0.4.6 16 | coloredlogs==15.0.1 17 | colorlog==6.7.0 18 | contourpy==1.2.0 19 | cycler==0.12.1 20 | decorator==5.1.1 21 | demucs==4.0.1 22 | docopt==0.6.2 23 | dora-search==0.1.12 24 | einops==0.7.0 25 | filelock==3.13.1 26 | flatbuffers==23.5.26 27 | fonttools==4.45.1 28 | frozenlist==1.4.0 29 | fsspec==2023.10.0 30 | greenlet==3.0.1 31 | huggingface-hub==0.19.4 32 | humanfriendly==10.0 33 | HyperPyYAML==1.2.2 34 | idna==3.6 35 | Jinja2==3.1.2 36 | joblib==1.3.2 37 | julius==0.2.7 38 | kiwisolver==1.4.5 39 | lameenc==1.6.3 40 | lazy_loader==0.3 41 | librosa==0.10.1 42 | lightning==2.1.2 43 | lightning-utilities==0.10.0 44 | llvmlite==0.41.1 45 | Mako==1.3.0 46 | markdown-it-py==3.0.0 47 | MarkupSafe==2.1.3 48 | matplotlib==3.8.2 49 | mdurl==0.1.2 50 | mpmath==1.3.0 51 | msgpack==1.0.7 52 | multidict==6.0.4 53 | mutagen==1.47.0 54 | networkx==3.2.1 55 | noisereduce==3.0.0 56 | numba==0.58.1 57 | numpy==1.26.2 58 | omegaconf==2.3.0 59 | onnxruntime==1.16.3 60 | openunmix==1.2.1 61 | optuna==3.4.0 62 | packaging==23.2 63 | pandas==2.1.3 64 | Pillow==10.1.0 65 | platformdirs==4.0.0 66 | pooch==1.8.0 67 | primePy==1.3 68 | protobuf==4.25.1 69 | pyannote.audio==3.1.0 70 | pyannote.core==5.0.0 71 | pyannote.database==5.0.1 72 | pyannote.metrics==3.2.1 73 | pyannote.pipeline==3.0.1 74 | pycparser==2.21 75 | pycryptodomex==3.19.0 76 | pydub==0.25.1 77 | Pygments==2.17.2 78 | pyparsing==3.1.1 79 | python-dateutil==2.8.2 80 | pytorch-lightning==2.1.2 81 | pytorch-metric-learning==2.3.0 82 | pytz==2023.3.post1 83 | PyYAML==6.0.1 84 | requests==2.31.0 85 | retrying==1.3.4 86 | rich==13.7.0 87 | ruamel.yaml==0.18.5 88 | ruamel.yaml.clib==0.2.8 89 | scikit-learn==1.3.2 90 | scipy==1.11.4 91 | semver==3.0.2 92 | sentencepiece==0.1.99 93 | shellingham==1.5.4 94 | six==1.16.0 95 | sortedcontainers==2.4.0 96 | soundfile==0.12.1 97 | soxr==0.3.7 98 | speechbrain==0.5.16 99 | SQLAlchemy==2.0.23 100 | submitit==1.5.1 101 | sympy==1.12 102 | tabulate==0.9.0 103 | tensorboardX==2.6.2.2 104 | threadpoolctl==3.2.0 105 | torch==2.1.1 106 | torch-audiomentations==0.11.0 107 | torch-pitch-shift==1.2.4 108 | torchaudio==2.1.1 109 | torchmetrics==1.2.0 110 | torchvision==0.16.1 111 | tqdm==4.66.1 112 | treetable==0.2.5 113 | typer==0.9.0 114 | typing_extensions==4.8.0 115 | tzdata==2023.3 116 | urllib3==2.1.0 117 | websockets==12.0 118 | yarl==1.9.3 119 | yt-dlp==2023.11.16 120 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import io 4 | from pathlib import Path 5 | import select 6 | from shutil import rmtree 7 | import subprocess as sp 8 | import sys 9 | from typing import Dict, Tuple, Optional, IO 10 | from pyannote.audio import Pipeline 11 | import torch 12 | import shutil 13 | 14 | from pydub import AudioSegment 15 | from pydub.silence import split_on_silence 16 | 17 | 18 | def clean(local, file_path, project_name, acapella, token, single_speaker): 19 | project_dir = os.getcwd() + "/" + project_name + "/" 20 | if local == True: 21 | ext = file_path.split(".")[-1] 22 | else: 23 | ext = "mp3" 24 | 25 | print(f"Project Folder: {project_dir}; ext: {ext}") 26 | 27 | setup_project(local, file_path, project_name) 28 | separate(inp = project_dir + "input", outp = project_dir + "output") 29 | 30 | if single_speaker: 31 | if acapella == True: 32 | print(f"Isolated vocals and file saved at {project_dir}{project_name}/output/htdemucs/file/") 33 | else: 34 | remove_silences(project_dir, ext) 35 | print(f"Separated vocals, removed silences and saved file at {project_dir}{project_name}/output/htdemucs/file/") 36 | else: 37 | if acapella == False: 38 | diarize_dataset(token, project_dir, ext, acapella = False, silences = True) 39 | print(f"Separated speakers, vocals, removed silences and saved file at {project_dir}{project_name}/output/htdemucs/file/") 40 | else: 41 | diarize_dataset(token, project_dir, ext, acapella = True, silences = False) 42 | print(f"Separated speakers, vocals and saved file at {project_dir}{project_name}/output/htdemucs/file/") 43 | 44 | def diarize_dataset(token, project_dir, ext, acapella, silences): 45 | if silences == True: 46 | remove_silences(project_dir, ext) 47 | file_path = f"{project_dir}output/htdemucs/file/silences_removed.{ext}" 48 | else: 49 | file_path = f"{project_dir}output/htdemucs/file/vocals.{ext}" 50 | 51 | pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0", use_auth_token=token) 52 | 53 | pipeline.to(torch.device("cuda")) 54 | #diarization = pipeline(file_path, embedding_exclude_overlap=True) 55 | diarization = pipeline(file_path) 56 | audio = AudioSegment.from_mp3(file_path) 57 | 58 | speakers = [] 59 | for turn, _, speaker in diarization.itertracks(yield_label=True): 60 | speakers.append(speaker) 61 | 62 | speakers = list(set(speakers)) 63 | 64 | buffer = {} 65 | for s in speakers: 66 | buffer[s] = AudioSegment.empty() 67 | 68 | 69 | list(set(speakers)) 70 | for turn, _, speaker in diarization.itertracks(yield_label=True): 71 | 72 | start = int(turn.start * 1000) 73 | end = int(turn.end * 1000) 74 | diff = end - start 75 | 76 | if acapella == False: 77 | speaker_audio = audio[start:end] 78 | buffer[speaker] += speaker_audio 79 | else: 80 | blank = AudioSegment.silent(diff) 81 | for iter in list(set(speakers)): 82 | if iter == speaker: 83 | speaker_audio = audio[start:end] 84 | buffer[speaker] += speaker_audio 85 | else: 86 | buffer[iter] += blank 87 | 88 | for s in speakers: 89 | buffer[s].export(project_dir + "output/htdemucs/file/" + s + "." + ext) 90 | 91 | return speakers 92 | 93 | 94 | def remove_silences(project_dir, ext): 95 | 96 | file_path = project_dir + "/output/htdemucs/file/vocals." + ext 97 | file_name = "silences_removed.mp3" 98 | audio_format = "mp3" 99 | sound = AudioSegment.from_file(file_path, format = audio_format) 100 | audio_chunks = split_on_silence(sound 101 | ,min_silence_len = 100 102 | ,silence_thresh = -45 103 | ,keep_silence = 50 104 | ) 105 | 106 | combined = AudioSegment.empty() 107 | for chunk in audio_chunks: 108 | combined += chunk 109 | combined.export(f'{project_dir}output/htdemucs/file/{file_name}', format = audio_format) 110 | 111 | def setup_project(local, file_path, project_name): 112 | 113 | pwd = os.getcwd() 114 | subprocess.run(f"mkdir {pwd}/{project_name}", shell = True) 115 | subprocess.run(f"mkdir {pwd}/{project_name}/input", shell = True) 116 | subprocess.run(f"mkdir {pwd}/{project_name}/output", shell = True) 117 | print(f"Created the project directory at {pwd}/{project_name}") 118 | 119 | if local: 120 | ext = file_path.split(".")[-1] 121 | shutil.copy(file_path, f"{pwd}/{project_name}/input/file.{ext}") 122 | print("Copied files from the given path to the working directory") 123 | else: 124 | out_dir = f"{pwd}/{project_name}/input/file.mp3" 125 | subprocess.run(f"yt-dlp -x --audio-format mp3 -o {out_dir} {file_path}", shell = True) 126 | print("Downloaded the YouTube video and saved it in the project directory") 127 | 128 | 129 | def find_files(in_path): 130 | out = [] 131 | for file in Path(in_path).iterdir(): 132 | if file.suffix.lower().lstrip(".") in ["mp3", "wav", "ogg", "flac"]: 133 | out.append(file) 134 | return out 135 | 136 | def copy_process_streams(process: sp.Popen): 137 | def raw(stream: Optional[IO[bytes]]) -> IO[bytes]: 138 | assert stream is not None 139 | if isinstance(stream, io.BufferedIOBase): 140 | stream = stream.raw 141 | return stream 142 | 143 | p_stdout, p_stderr = raw(process.stdout), raw(process.stderr) 144 | stream_by_fd: Dict[int, Tuple[IO[bytes], io.StringIO, IO[str]]] = { 145 | p_stdout.fileno(): (p_stdout, sys.stdout), 146 | p_stderr.fileno(): (p_stderr, sys.stderr), 147 | } 148 | fds = list(stream_by_fd.keys()) 149 | 150 | while fds: 151 | ready, _, _ = select.select(fds, [], []) 152 | for fd in ready: 153 | p_stream, std = stream_by_fd[fd] 154 | raw_buf = p_stream.read(2 ** 16) 155 | if not raw_buf: 156 | fds.remove(fd) 157 | continue 158 | buf = raw_buf.decode() 159 | std.write(buf) 160 | std.flush() 161 | 162 | def separate(inp=None, outp=None): 163 | inp = inp or in_path 164 | outp = outp or out_path 165 | cmd = ["python3", "-m", "demucs.separate", "-o", str(outp), "-n", "htdemucs"] 166 | cmd += ["--mp3", "--mp3-bitrate=320"] 167 | cmd += [f"--two-stems=vocals"] 168 | 169 | files = [str(f) for f in find_files(inp)] 170 | if not files: 171 | print(f"No valid audio files in {in_path}") 172 | return 173 | p = sp.Popen(cmd + files, stdout=sp.PIPE, stderr=sp.PIPE) 174 | copy_process_streams(p) 175 | p.wait() 176 | if p.returncode != 0: 177 | print("Command failed, something went wrong.") 178 | --------------------------------------------------------------------------------