├── outputs
    └── .placeholder
├── Windows_app_start.bat
├── demo_info
    ├── ui.png
    ├── Rogger_sample_aa.wav
    ├── Rogger_sample_en.wav
    └── Rogger_sample_ru.wav
├── targets
    └── Rogger.wav
├── requirements.txt
├── texts.json
├── languages.json
├── appTerminal.py
├── LICENSE
├── .gitignore
├── app.py
├── app2.py
└── README.md


/outputs/.placeholder:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Windows_app_start.bat:
--------------------------------------------------------------------------------
1 | @echo off
2 | cd
3 | call venv\Scripts\activate
4 | python app.py


--------------------------------------------------------------------------------
/demo_info/ui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoltzmannEntropy/xtts2-ui/HEAD/demo_info/ui.png


--------------------------------------------------------------------------------
/targets/Rogger.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoltzmannEntropy/xtts2-ui/HEAD/targets/Rogger.wav


--------------------------------------------------------------------------------
/demo_info/Rogger_sample_aa.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoltzmannEntropy/xtts2-ui/HEAD/demo_info/Rogger_sample_aa.wav


--------------------------------------------------------------------------------
/demo_info/Rogger_sample_en.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoltzmannEntropy/xtts2-ui/HEAD/demo_info/Rogger_sample_en.wav


--------------------------------------------------------------------------------
/demo_info/Rogger_sample_ru.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoltzmannEntropy/xtts2-ui/HEAD/demo_info/Rogger_sample_ru.wav


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | gradio==4.7.1
2 | TTS==0.21.*
3 | soundfile==0.12.1
4 | transformers==4.33.0
5 | streamlit
6 | audio-recorder-streamlit
7 | 


--------------------------------------------------------------------------------
/texts.json:
--------------------------------------------------------------------------------
1 | [   "You can add the Sample texts here, so that you can generate a lot of voices to prepare dataset for training purpose",
2 |     "Hello, this is a test.",
3 |     "This is another test.",
4 |     "This is the final test."
5 | ]


--------------------------------------------------------------------------------
/languages.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "Arabic": "ar",
 3 |   "Chinese": "zh-cn",
 4 |   "Czech": "cs",
 5 |   "Dutch": "nl",
 6 |   "English": "en",
 7 |   "French": "fr",
 8 |   "German": "de",
 9 |   "Hungarian": "hu",
10 |   "Italian": "it",
11 |   "Japanese": "ja",
12 |   "Korean": "ko",
13 |   "Polish": "pl",
14 |   "Portuguese": "pt",
15 |   "Russian": "ru",
16 |   "Spanish": "es",
17 |   "Turkish": "tr"
18 | }


--------------------------------------------------------------------------------
/appTerminal.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | from pathlib import Path
 3 | from app import gen_voice, tts, update_speakers, languages
 4 | 
 5 | def generate_voices_from_file(file_path):
 6 |     # Load the texts from the JSON file
 7 |     with open(file_path, 'r') as f:
 8 |         texts = json.load(f)
 9 | 
10 |     # Get the list of speakers
11 |     speakers = update_speakers()
12 | 
13 |     # For each text, generate a voice for each speaker
14 |     for text in texts:
15 |         for speaker in speakers:
16 |             gen_voice(text, speaker, speed=0.8, english="English")
17 | 
18 | if __name__ == "__main__":
19 |     generate_voices_from_file(Path('texts.json'))


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Shlomo Kashani
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | outputs/*
  2 | !outputs/.placeholder
  3 | targets/*
  4 | !target/Rogger.wav
  5 | 
  6 | # Byte-compiled / optimized / DLL files
  7 | __pycache__/
  8 | *.py[cod]
  9 | *$py.class
 10 | 
 11 | # C extensions
 12 | *.so
 13 | 
 14 | # Distribution / packaging
 15 | .Python
 16 | build/
 17 | develop-eggs/
 18 | dist/
 19 | downloads/
 20 | eggs/
 21 | .eggs/
 22 | lib/
 23 | lib64/
 24 | parts/
 25 | sdist/
 26 | var/
 27 | wheels/
 28 | share/python-wheels/
 29 | *.egg-info/
 30 | .installed.cfg
 31 | *.egg
 32 | MANIFEST
 33 | 
 34 | # PyInstaller
 35 | #  Usually these files are written by a python script from a template
 36 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 37 | *.manifest
 38 | *.spec
 39 | 
 40 | # Installer logs
 41 | pip-log.txt
 42 | pip-delete-this-directory.txt
 43 | 
 44 | # Unit test / coverage reports
 45 | htmlcov/
 46 | .tox/
 47 | .nox/
 48 | .coverage
 49 | .coverage.*
 50 | .cache
 51 | nosetests.xml
 52 | coverage.xml
 53 | *.cover
 54 | *.py,cover
 55 | .hypothesis/
 56 | .pytest_cache/
 57 | cover/
 58 | 
 59 | # Translations
 60 | *.mo
 61 | *.pot
 62 | 
 63 | # Django stuff:
 64 | *.log
 65 | local_settings.py
 66 | db.sqlite3
 67 | db.sqlite3-journal
 68 | 
 69 | # Flask stuff:
 70 | instance/
 71 | .webassets-cache
 72 | 
 73 | # Scrapy stuff:
 74 | .scrapy
 75 | 
 76 | # Sphinx documentation
 77 | docs/_build/
 78 | 
 79 | # PyBuilder
 80 | .pybuilder/
 81 | target/
 82 | 
 83 | # Jupyter Notebook
 84 | .ipynb_checkpoints
 85 | 
 86 | # IPython
 87 | profile_default/
 88 | ipython_config.py
 89 | 
 90 | # pyenv
 91 | #   For a library or package, you might want to ignore these files since the code is
 92 | #   intended to run in multiple environments; otherwise, check them in:
 93 | # .python-version
 94 | 
 95 | # pipenv
 96 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 97 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 98 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 99 | #   install all needed dependencies.
100 | #Pipfile.lock
101 | 
102 | # poetry
103 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
104 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
105 | #   commonly ignored for libraries.
106 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
107 | #poetry.lock
108 | 
109 | # pdm
110 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
111 | #pdm.lock
112 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
113 | #   in version control.
114 | #   https://pdm.fming.dev/#use-with-ide
115 | .pdm.toml
116 | 
117 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
118 | __pypackages__/
119 | 
120 | # Celery stuff
121 | celerybeat-schedule
122 | celerybeat.pid
123 | 
124 | # SageMath parsed files
125 | *.sage.py
126 | 
127 | # Environments
128 | .env
129 | .venv
130 | env/
131 | venv/
132 | ENV/
133 | env.bak/
134 | venv.bak/
135 | 
136 | # Spyder project settings
137 | .spyderproject
138 | .spyproject
139 | 
140 | # Rope project settings
141 | .ropeproject
142 | 
143 | # mkdocs documentation
144 | /site
145 | 
146 | # mypy
147 | .mypy_cache/
148 | .dmypy.json
149 | dmypy.json
150 | 
151 | # Pyre type checker
152 | .pyre/
153 | 
154 | # pytype static type analyzer
155 | .pytype/
156 | 
157 | # Cython debug symbols
158 | cython_debug/
159 | 
160 | # PyCharm
161 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
162 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
163 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
164 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
165 | #.idea/
166 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import gradio as gr
  2 | import torch
  3 | import platform
  4 | import random
  5 | import json
  6 | from pathlib import Path
  7 | from TTS.api import TTS
  8 | import uuid
  9 | import html
 10 | import soundfile as sf
 11 | 
 12 | def is_mac_os():
 13 |     return platform.system() == 'Darwin'
 14 | 
 15 | params = {
 16 |     "activate": True,
 17 |     "autoplay": True,
 18 |     "show_text": False,
 19 |     "remove_trailing_dots": False,
 20 |     "voice": "Rogger.wav",
 21 |     "language": "English",
 22 |     "model_name": "tts_models/multilingual/multi-dataset/xtts_v2",
 23 | }
 24 | 
 25 | # SUPPORTED_FORMATS = ['wav', 'mp3', 'flac', 'ogg']
 26 | SAMPLE_RATE = 16000
 27 | device = None
 28 | 
 29 | # Set the default speaker name
 30 | default_speaker_name = "Rogger"
 31 | 
 32 | if is_mac_os():
 33 |     device = torch.device('cpu')
 34 | else:
 35 |     device = torch.device('cuda:0')
 36 | 
 37 | # Load model
 38 | tts = TTS(model_name=params["model_name"]).to(device)
 39 | 
 40 | # # Random sentence (assuming harvard_sentences.txt is in the correct path)
 41 | # def random_sentence():
 42 | #     with open(Path("harvard_sentences.txt")) as f:
 43 | #         return random.choice(list(f))
 44 | 
 45 | # Voice generation function
 46 | def gen_voice(string, spk, speed, english):
 47 |     string = html.unescape(string)
 48 |     short_uuid = str(uuid.uuid4())[:8]
 49 |     fl_name='outputs/' + spk + "-" + short_uuid +'.wav'
 50 |     output_file = Path(fl_name)
 51 |     this_dir = str(Path(__file__).parent.resolve())
 52 |     tts.tts_to_file(
 53 |         text=string,
 54 |         speed=speed,
 55 |         file_path=output_file,
 56 |         speaker_wav=[f"{this_dir}/targets/" +spk + ".wav"],
 57 |         language=languages[english]
 58 |     )
 59 |     return output_file
 60 | 
 61 | def update_speakers():
 62 |     speakers = {p.stem: str(p) for p in list(Path('targets').glob("*.wav"))}
 63 |     return list(speakers.keys())
 64 | 
 65 | def update_dropdown(_=None, selected_speaker=default_speaker_name):
 66 |     return gr.Dropdown(choices=update_speakers(), value=selected_speaker, label="Select Speaker")
 67 | 
 68 | def handle_recorded_audio(audio_data, speaker_dropdown, filename = "user_entered"):
 69 |     if not audio_data:
 70 |         return speaker_dropdown
 71 |     
 72 |     sample_rate, audio_content = audio_data
 73 |     
 74 |     save_path = f"targets/{filename}.wav"
 75 | 
 76 |     # Write the audio content to a WAV file
 77 |     sf.write(save_path, audio_content, sample_rate)
 78 | 
 79 |     # Create a new Dropdown with the updated speakers list, including the recorded audio
 80 |     updated_dropdown = update_dropdown(selected_speaker=filename)
 81 |     return updated_dropdown
 82 | 
 83 | 
 84 | # Load the language data
 85 | with open(Path('languages.json'), encoding='utf8') as f:
 86 |     languages = json.load(f)
 87 | 
 88 | # Gradio Blocks interface
 89 | with gr.Blocks() as app:
 90 |     
 91 |     gr.Markdown("### TTS based Voice Cloning.")
 92 |     
 93 |     with gr.Row():
 94 |         with gr.Column():
 95 |             text_input = gr.Textbox(lines=2, label="Speechify this Text",value="Even in the darkest nights, a single spark of hope can ignite the fire of determination within us, guiding us towards a future we dare to dream.")
 96 |             speed_slider = gr.Slider(label='Speed', minimum=0.1, maximum=1.99, value=0.8, step=0.01)
 97 |             language_dropdown = gr.Dropdown(list(languages.keys()), label="Language/Accent", value="English")
 98 | 
 99 |             gr.Markdown("### Speaker Selection and Voice Cloning")
100 |             
101 |             with gr.Row():
102 |                 with gr.Column():
103 |                     speaker_dropdown = update_dropdown()
104 |                     refresh_button = gr.Button("Refresh Speakers")
105 |                 with gr.Column():
106 |                     filename_input = gr.Textbox(label="Add new Speaker", placeholder="Enter a name for your recording/upload to save as")
107 |                     save_button = gr.Button("Save Below Recording")
108 |                 
109 |             refresh_button.click(fn=update_dropdown, inputs=[], outputs=speaker_dropdown)
110 | 
111 |             with gr.Row():
112 |                 record_button = gr.Audio(label="Record Your Voice")
113 |                 
114 |             save_button.click(fn=handle_recorded_audio, inputs=[record_button, speaker_dropdown, filename_input], outputs=speaker_dropdown)
115 |             record_button.stop_recording(fn=handle_recorded_audio, inputs=[record_button, filename_input], outputs=speaker_dropdown)
116 |             record_button.upload(fn=handle_recorded_audio, inputs=[record_button, filename_input], outputs=speaker_dropdown)
117 |             
118 |             submit_button = gr.Button("Convert")
119 | 
120 |         with gr.Column():
121 |             audio_output = gr.Audio()
122 | 
123 |     submit_button.click(
124 |         fn=gen_voice,
125 |         inputs=[text_input, speaker_dropdown, speed_slider, language_dropdown],
126 |         outputs=audio_output
127 |     )
128 | 
129 | if __name__ == "__main__":
130 |     app.launch()


--------------------------------------------------------------------------------
/app2.py:
--------------------------------------------------------------------------------
  1 | 
  2 | from TTS.api import TTS
  3 | import time
  4 | import json
  5 | import random
  6 | import time
  7 | from pathlib import Path
  8 | import streamlit as st
  9 | import time
 10 | import html 
 11 | 
 12 | import gradio as gr
 13 | import uuid
 14 | import torch 
 15 | import librosa
 16 | import streamlit as st
 17 | from audio_recorder_streamlit import audio_recorder
 18 | from scipy.io.wavfile import write
 19 | 
 20 | params = {
 21 |     "activate": True,
 22 |     "autoplay": True,
 23 |     "show_text": False,
 24 |     "remove_trailing_dots": False,
 25 |     "voice": "Rogger.wav",
 26 |     "language": "English",
 27 |     "model_name": "tts_models/multilingual/multi-dataset/xtts_v2",
 28 |     # "model_path":"./models/",
 29 |     # "config_path":"./models/config.json"
 30 | }
 31 | 
 32 | SUPPORTED_FORMATS = ['wav']
 33 | SAMPLE_RATE = 16000
 34 | 
 35 | speakers = {p.stem: str(p) for p in list(Path('targets').iterdir())}
 36 | 
 37 | device = "cuda:0" if torch.cuda.is_available() else "cpu"
 38 | print(f"Device: {device}")
 39 | 
 40 | # device = torch.device('cuda:0')
 41 | 
 42 | # device = torch.device('cuda:0')
 43 | import os 
 44 | os.makedirs(os.path.join(".", "targets"), exist_ok=True)
 45 | os.makedirs(os.path.join(".", "outputs"), exist_ok=True)
 46 | 
 47 | @st.cache_resource 
 48 | # @st.experimental_singleton
 49 | def load_model():
 50 |     global tts
 51 |     print("[XTTS] Loading XTTS...")
 52 |     tts = TTS(model_name=params["model_name"]).to(device)
 53 |     # model_path=params["model_path"],
 54 |     # config_path=params["config_path"]).
 55 |     return tts
 56 | 
 57 | tts=load_model()
 58 | 
 59 | def get_available_voices():
 60 |     return sorted([voice.name for voice in Path(f"{this_dir}/targets").glob("*.wav")])
 61 | 
 62 | def random_sentence():
 63 |     with open(Path("harvard_sentences.txt")) as f:
 64 |         return random.choice(list(f))
 65 | 
 66 | st.title("TTS based Voice Cloning in 16 Languages.")
 67 | # st.image('logo.png', width=150)
 68 | 
 69 | st.header('Text to speech generation')
 70 | 
 71 | this_dir = str(Path(__file__).parent.resolve())
 72 | languages=None 
 73 | with open(Path(f"{this_dir}/languages.json"), encoding='utf8') as f:
 74 |     languages = json.load(f)
 75 | 
 76 | with st.sidebar:
 77 |     voice_list=get_available_voices()
 78 |     print (voice_list)
 79 |     st.title("Text to Voice")
 80 |     english = st.radio(
 81 |         label="Choose your language", options=languages, index=0, horizontal=True)
 82 |    
 83 |     default_speaker_name = "Rogger"
 84 |     speaker_name = st.selectbox('Select target speaker:', options=[None] + list(speakers.keys()), 
 85 |     index=[key for key in speakers.keys()].index(default_speaker_name) + 1 if default_speaker_name in speakers else 0)
 86 | 
 87 |     wav_tgt=None
 88 |     if speaker_name is not None:
 89 |         wav_tgt, _ = librosa.load(speakers[speaker_name],sr=22000)
 90 |         wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)
 91 | 
 92 |         st.write('Selected Target:')
 93 |         st.audio(wav_tgt, sample_rate=22000)
 94 | 
 95 |     # Upload the WAV file
 96 |    
 97 |     text = st.text_area('Enter text to convert to audio format',
 98 |     value="Hello")
 99 |     speed = st.slider('Speed', 0.1, 1.99, 0.8, 0.01)
100 |     
101 | st.caption ("Optional Microphone Recording. Download and rename your recording before using.")
102 | audio_bytes = audio_recorder()
103 | if audio_bytes:
104 |     st.audio(audio_bytes, format="audio/wav")  
105 | 
106 |     def gen_voice(string,spk):
107 |         string = html.unescape(string)
108 |         # Generate a short UUID
109 |         short_uuid = str(uuid.uuid4())[:8]
110 |         fl_name='outputs/' + spk + "-" + short_uuid +'.wav'
111 |         output_file = Path(fl_name)
112 |         tts.tts_to_file(
113 |             text=string,
114 |             speed=speed,
115 |             file_path=output_file,
116 |             speaker_wav=[f"{this_dir}/targets/" +spk + ".wav"],
117 |             language=languages[english]
118 |         )
119 | 
120 |         return output_file
121 | 
122 | # Upload the WAV file
123 | st.caption ("For the audio file, use the name of your Target, for instance ABIDA.wav")
124 | new_tgt = st.file_uploader('Upload a new TARGET audio WAV file:', type=SUPPORTED_FORMATS, accept_multiple_files=False)
125 | if new_tgt is not None:
126 |     # Get the original file name
127 |     file_name = new_tgt.name
128 | 
129 |     # Save the file to the file system
130 |     file_path = os.path.join("./targets/", file_name)
131 |     st.info(f"Original file name: {file_name}")
132 |     
133 |     # Extract the file name without the extension
134 |     file_name_without_extension = os.path.splitext(file_name)[0]
135 | 
136 |     # Use librosa to load and process the WAV file
137 |     wav_tgt, _ = librosa.load(new_tgt, sr=22000)
138 |     wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)
139 | 
140 |     # Use scipy.io.wavfile.write to save the processed WAV file
141 |     write('./targets/' + file_name_without_extension + '.wav', 22000, wav_tgt)
142 | 
143 |     st.success(f"New target saved successfully to {file_path}")
144 |     
145 | if st.button('Convert'):
146 |     # Run TTS
147 |     st.success('Converting ... please wait ...')
148 |     output_file=gen_voice(text, speaker_name)
149 |     # tts.tts_to_file(text=text, speed=speed, speaker=speaker, file_path="out.wav")
150 | 
151 |     st.write(f'Target voice:'+speaker_name)
152 |     # st.success('Converted to audio successfully')
153 |     
154 |     audio_file = open(output_file, 'rb')
155 |     audio_bytes = audio_file.read()
156 |     st.audio(audio_bytes, format='audio/wav')
157 |     # st.success("You can now play the audio by clicking on the play button.")
158 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # XTTS-2-UI: A User Interface for XTTS-2 Text-Based Voice Cloning
  2 | 
  3 | This repository contains the essential code for cloning any voice using just text and a 10-second audio sample of the target voice. XTTS-2-UI is simple to setup and use. [Example Results 🔊](#examples)
  4 | 
  5 | Works in [16 languages](#language-support) and has in-built voice recording/uploading. 
  6 | Note: Don't expect EL level quality, it is not there yet. 
  7 | 
  8 | ## Model 
  9 | The model used is `tts_models/multilingual/multi-dataset/xtts_v2`. For more details, refer to [Hugging Face - XTTS-v2](https://huggingface.co/coqui/XTTS-v2) and its specific version [XTTS-v2 Version 2.0.2](https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2).
 10 | 
 11 | <h1 align="center">    
 12 |   <img src="demo_info/ui.png" width="100%"></a>  
 13 | </h1>
 14 | 
 15 | ## Table of Contents
 16 | 
 17 | - [XTTS-2-UI: A User Interface for XTTS-2 Text-Based Voice Cloning](#xtts-2-ui-a-user-interface-for-xtts-2-text-based-voice-cloning)
 18 |   - [Model](#model)
 19 |   - [Table of Contents](#table-of-contents)
 20 |   - [Setup](#setup)
 21 |   - [Inference](#inference)
 22 |   - [Target Voices Dataset](#target-voices-dataset)
 23 |   - [Sample Audio Examples:](#sample-audio-examples)
 24 |   - [Language Support](#language-support)
 25 |   - [Notes](#notes)
 26 |   - [Credits](#credits)
 27 | 
 28 | ## Setup
 29 | 
 30 | To set up this project, follow these steps in a terminal:
 31 | 
 32 | 1. **Clone the Repository**
 33 | 
 34 |     - Clone the repository to your local machine.
 35 |       ```bash
 36 |       git clone https://github.com/pbanuru/xtts2-ui.git
 37 |       cd xtts2-ui
 38 |       ```
 39 | 
 40 | 2. **Create a Virtual Environment:**
 41 |    - Run the following command to create a Python virtual environment:
 42 |      ```bash
 43 |      python -m venv venv
 44 |      ```
 45 |    - Activate the virtual environment:
 46 |      - Windows:
 47 |        ```bash
 48 |        # cmd prompt
 49 |        venv\Scripts\activate
 50 |        ```
 51 |        or
 52 |        
 53 |        ```bash
 54 |        # git bash
 55 |        source venv/Scripts/activate
 56 |        ```
 57 |      - Linux/Mac:
 58 |        ```bash
 59 |        source venv/bin/activate
 60 |        ```
 61 | 
 62 | 3. **Install PyTorch:**
 63 |    
 64 |    - If you have an Nvidia CUDA-Enabled GPU, choose the appropriate PyTorch installation command:
 65 |      - Before installing PyTorch, check your CUDA version by running:
 66 |        ```bash
 67 |        nvcc --version
 68 |        ```
 69 |      - For CUDA 12.1:
 70 |        ```bash
 71 |        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
 72 |        ```
 73 |      - For CUDA 11.8:
 74 |        ```bash
 75 |        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 76 |        ```
 77 |    - If you don't have a CUDA-enabled GPU,:
 78 |      Follow the instructions on the [PyTorch website](https://pytorch.org/get-started/locally/) to install the appropriate version of PyTorch for your system.
 79 | 
 80 | 4. **Install Other Required Packages:**
 81 |    - Install direct dependencies:
 82 |      ```bash
 83 |      pip install -r requirements.txt
 84 |      ```
 85 |    - Upgrade the TTS package to the latest version:
 86 |      ```bash
 87 |      pip install --upgrade TTS
 88 |      ```
 89 | 
 90 | 
 91 |      
 92 | 
 93 | After completing these steps, your setup should be complete and you can start using the project.
 94 | 
 95 | Models will be downloaded automatically upon first use.
 96 | 
 97 | Download paths:
 98 | - MacOS: `/Users/USR/Library/Application Support/tts/tts_models--multilingual--multi-dataset--xtts_v2`
 99 | - Windows: `C:\Users\ YOUR-USER-ACCOUNT \AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2`
100 | - Linux: `/home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2`
101 | 
102 | 
103 | 
104 | ## Inference
105 | To run the application:
106 | 
107 | ```
108 | python app.py
109 | OR
110 | streamlit run app2.py 
111 | ```
112 | Or, You can also run from the terminal itself, by providing sample input texts on texts.json and generate multiple audios with multiple speakers, (you may need to adjust on appTerminal.py)
113 | ```
114 | python appTerminal.py
115 | ```
116 | On initial use, you will need to agree to the terms:
117 | 
118 | ```
119 | [XTTS] Loading XTTS...
120 |  > tts_models/multilingual/multi-dataset/xtts_v2 has been updated, clearing model cache...
121 |  > You must agree to the terms of service to use this model.
122 |  | > Please see the terms of service at https://coqui.ai/cpml.txt
123 |  | > "I have read, understood and agreed to the Terms and Conditions." - [y/n]
124 |  | | >
125 |  ```
126 | 
127 | If your model is re-downloading each run, please consult [Issue 4723 on GitHub](https://github.com/oobabooga/text-generation-webui/issues/4723#issuecomment-1826120220).
128 | 
129 | ## Target Voices Dataset
130 | The dataset consists of a single folder named `targets`, pre-populated with several voices for testing purposes.
131 | 
132 | To add more voices (if you don't want to go through the GUI), create a 24KHz WAV file of approximately 10 seconds and place it under the `targets` folder. 
133 | You can use yt-dlp to download a voice from YouTube for cloning:
134 | ```
135 | yt-dlp -x --audio-format wav "https://www.youtube.com/watch?"
136 | ```
137 | 
138 | 
139 | ## Sample Audio Examples:
140 | 
141 | | Language | Audio Sample Link |
142 | |----------|-------------------|
143 | | English  | [▶️](demo_info/Rogger_sample_en.wav) |
144 | | Russian  | [▶️](demo_info/Rogger_sample_ru.wav) |
145 | | Arabic   | [▶️](demo_info/Rogger_sample_aa.wav) |
146 | 
147 | ## Language Support
148 | Arabic, Chinese, Czech, Dutch, English, French, German, Hungarian, Italian, Japanese[ (see setup)](#notes), Korean, Polish, Portuguese, Russian, Spanish, Turkish
149 | 
150 | ## Notes
151 | If you would like to select **Japanese** as the target language, you must install a dictionary.
152 | ```bash
153 | # Lite version
154 | pip install fugashi[unidic-lite]
155 | ```
156 | or for more serious processing:
157 | ```bash
158 | # Full version
159 | pip install fugashi[unidic]
160 | python -m unidic download
161 | ```
162 | More details [here](https://github.com/polm/fugashi#installing-a-dictionary).
163 | 
164 | 
165 | ## Credits
166 | 1. Heavily based on https://github.com/kanttouchthis/text_generation_webui_xtts/ 
167 | 


--------------------------------------------------------------------------------