├── .gitignore
├── LICENSE
├── README.md
├── README_OpenVoice.md
├── build_docker.py
├── default.wav
├── demo
    ├── Female-Young-British.mp3
    ├── Male-MiddleAged_American.mp3
    ├── default.wav
    └── demo.py
├── demo_part1.ipynb
├── demo_part2.ipynb
├── demo_part3.ipynb
├── docs
    ├── QA.md
    └── USAGE.md
├── openvoice
    ├── __init__.py
    ├── api.py
    ├── attentions.py
    ├── commons.py
    ├── mel_processing.py
    ├── models.py
    ├── modules.py
    ├── openvoice_app.py
    ├── openvoice_server.py
    ├── se_extractor.py
    ├── text
    │   ├── __init__.py
    │   ├── cleaners.py
    │   ├── english.py
    │   ├── mandarin.py
    │   └── symbols.py
    ├── transforms.py
    └── utils.py
├── openvoice_server
    ├── Dockerfile
    ├── start.sh
    └── start_and_stop_server.sh
├── requirements.txt
├── resources
    ├── demo_speaker0.mp3
    ├── demo_speaker1.mp3
    ├── demo_speaker2.mp3
    ├── example_reference.mp3
    ├── framework-ipa.png
    ├── huggingface.png
    ├── lepton-hd.png
    ├── myshell-hd.png
    ├── openvoicelogo.jpg
    ├── tts-guide.png
    └── voice-clone-guide.png
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | __pycache__/
 2 | .ipynb_checkpoints/
 3 | processed
 4 | outputs
 5 | outputs_v2
 6 | checkpoints
 7 | checkpoints_v2
 8 | trash
 9 | examples*
10 | .env
11 | build
12 | *.egg-info/
13 | *.zip
14 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2024 MyShell.ai
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # OpenVoice Server
  2 | 
  3 | OpenVoice Server is a FastAPI application that provides endpoints for uploading audio files, performing text-to-speech conversion, and synthesizing speech from text using a specified voice and style.
  4 | It is built on top of the OpenVoice project, which is a versatile instant voice cloning system that can accurately clone the reference tone color and generate speech in multiple languages and accents.
  5 | See: https://github.com/myshell-ai/OpenVoice
  6 | 
  7 | ## Runpod.io template
  8 | I have provided 2 easy to deploy one-click templates for Runpod.io, for those who want to quickly deploy the OpenVoice Server on a Runpod instance.
  9 | The first template uses OpenVoice V1, and the second template uses OpenVoice V2, there are slight changes in the API endpoints (v1 has style and language, v2 only has accent as parameters).
 10 | 
 11 | V1 is slightly faster but only supports English, while V2 sounds better and supports multiple languages and accents.
 12 | 
 13 | OpenVoice V1 server:
 14 | https://runpod.io/console/deploy?template=6uvya46w7n&ref=2vdt3dn9
 15 | 
 16 | OpenVoice V2 server:
 17 | https://runpod.io/console/deploy?template=qo8v6w92q2&ref=2vdt3dn9
 18 | 
 19 | ## Installation
 20 | 
 21 | 1. Clone the repository:
 22 |    ```bash
 23 |    git clone https://github.com/ValyrianTech/OpenVoice_server.git
 24 |    ```
 25 | 2. Navigate to the project directory:
 26 |    ```bash
 27 |    cd openvoice_server
 28 |    ```
 29 | 3. Install the required dependencies:
 30 |    ```bash
 31 |    pip install -r requirements.txt
 32 |    ```
 33 | 
 34 | ## Usage
 35 | 
 36 | To start the server, run the following command:
 37 | 
 38 | ```bash
 39 | cd openvoice
 40 | uvicorn openvoice_server:app --host 0.0.0.0 --port 8000
 41 | ```
 42 | 
 43 | The server provides the following endpoints:
 44 | 
 45 | ### 1. Base Text-to-Speech
 46 | 
 47 | This endpoint performs text-to-speech conversion using only the base speaker.
 48 | 
 49 | **Endpoint:** `/base_tts/`
 50 | 
 51 | **Method:** `GET`
 52 | 
 53 | **Request Body:**
 54 | 
 55 | - `text` (str): The text to be converted to speech.
 56 | - `accent` (str, optional): The accent to be used for the synthesized speech. Defaults to 'en-newest'. Options are: 'en-au', 'en-br', 'en-default', 'en-india', 'en-newest', 'en-us', 'es', 'fr', 'jp', 'kr', 'zh'
 57 | - `speed` (float, optional): The speed of the synthesized speech. Defaults to 1.0.
 58 | 
 59 | **Example Request:**
 60 | 
 61 | ```python
 62 | import requests
 63 | 
 64 | url = "http://localhost:8000/base_tts/"
 65 | params = {
 66 |     "text": "Hello, world!",
 67 |     "accent": "en-newest",
 68 |     "language": "English",
 69 |     "speed": 1.0
 70 | }
 71 | 
 72 | response = requests.get(url, params=params)
 73 | 
 74 | with open("output.wav", "wb") as f:
 75 |     f.write(response.content)
 76 | ```
 77 | 
 78 | ### 2. Change Voice
 79 | 
 80 | This endpoint allows you to change the voice of an existing audio file.
 81 | 
 82 | **Endpoint:** `/change_voice/`
 83 | 
 84 | **Method:** `POST`
 85 | 
 86 | **Request Body:**
 87 | 
 88 | - `reference_speaker` (str): The name of the reference speaker.
 89 | - `file` (file): The audio file to be changed.
 90 | - `watermark` (str, optional): The watermark to be encoded in the voice conversion. Defaults to '@MyShell'.
 91 | 
 92 | **Example Request:**
 93 | 
 94 | ```python
 95 | import requests
 96 | 
 97 | url = "http://localhost:8000/change_voice/"
 98 | file = open("example.wav", "rb")
 99 | data = {"reference_speaker": "example_label"}
100 | 
101 | response = requests.post(url, data=data, files={"file": file})
102 | 
103 | with open("output.wav", "wb") as f:
104 |     f.write(response.content)
105 | ```
106 | 
107 | ### 3. Upload Audio
108 | 
109 | This endpoint allows you to upload an audio file that will be used as the reference audio for speech synthesis.
110 | 
111 | **Endpoint:** `/upload_audio/`
112 | 
113 | **Method:** `POST`
114 | 
115 | **Request Body:**
116 | 
117 | - `audio_file_label` (str): The label for the audio file.
118 | - `file` (file): The audio file to be uploaded.
119 | 
120 | **Example Request:**
121 | 
122 | ```python
123 | import requests
124 | 
125 | url = "http://localhost:8000/upload_audio/"
126 | audio_file_label = "example_label"
127 | file = open("example.wav", "rb")
128 | 
129 | response = requests.post(url, data={"audio_file_label": audio_file_label}, files={"file": file})
130 | 
131 | print(response.json())
132 | ```
133 | 
134 | ### 4. Synthesize Speech
135 | 
136 | This endpoint synthesizes speech from text using a specified voice and style.
137 | 
138 | **Endpoint:** `/synthesize_speech/`
139 | 
140 | **Method:** `GET`
141 | 
142 | **Request Body:**
143 | 
144 | - `text` (str): The text to be synthesized into speech.
145 | - `voice` (str): The voice to be used for the synthesized speech.
146 | - `accent` (str, optional): The accent to be used for the synthesized speech. Defaults to 'en-newest'. Options are: 'en-au', 'en-br', 'en-default', 'en-india', 'en-newest', 'en-us', 'es', 'fr', 'jp', 'kr', 'zh'
147 | - `speed` (float, optional): The speed of the synthesized speech. Defaults to 1.0.
148 | - `watermark` (str, optional): The watermark to be encoded in the voice conversion. Defaults to '@MyShell'.
149 | 
150 | Note: OpenVoice comes with a few voices as examples, the following are available:
151 | - example_reference
152 | - demo_speaker0
153 | - demo_speaker1
154 | - demo_speaker2
155 | 
156 | **Example Request:**
157 | 
158 | ```python
159 | import requests
160 | 
161 | url = "http://localhost:8000/synthesize_speech/"
162 | data = {
163 |     "text": "Hello, world!",
164 |     "voice": "example_label",
165 |     "accent": "en-newest",
166 |     "speed": 1.0
167 | }
168 | 
169 | response = requests.get(url, params=data)
170 | 
171 | with open("output.wav", "wb") as f:
172 |     f.write(response.content)
173 | ```
174 | 
175 | The response will be the synthesized speech audio file. In the headers of the response are 2 additional fields:
176 | - x-elapsed-time: The time taken to synthesize the speech in seconds.
177 | - x-device-used: The device used for synthesis.
178 | 


--------------------------------------------------------------------------------
/README_OpenVoice.md:
--------------------------------------------------------------------------------
 1 | <div align="center">
 2 |   <div>&nbsp;</div>
 3 |   <img src="resources/openvoicelogo.jpg" width="400"/> 
 4 | 
 5 | [Paper](https://arxiv.org/abs/2312.01479) |
 6 | [Website](https://research.myshell.ai/open-voice) 
 7 | 
 8 | </div>
 9 | 
10 | ## Introduction
11 | As we detailed in our [paper](https://arxiv.org/abs/2312.01479) and [website](https://research.myshell.ai/open-voice), the advantages of OpenVoice are three-fold:
12 | 
13 | **1. Accurate Tone Color Cloning.**
14 | OpenVoice can accurately clone the reference tone color and generate speech in multiple languages and accents.
15 | 
16 | **2. Flexible Voice Style Control.**
17 | OpenVoice enables granular control over voice styles, such as emotion and accent, as well as other style parameters including rhythm, pauses, and intonation. 
18 | 
19 | **3. Zero-shot Cross-lingual Voice Cloning.**
20 | Neither of the language of the generated speech nor the language of the reference speech needs to be presented in the massive-speaker multi-lingual training dataset.
21 | 
22 | [Video](https://github.com/myshell-ai/OpenVoice/assets/40556743/3cba936f-82bf-476c-9e52-09f0f417bb2f)
23 | 
24 | OpenVoice has been powering the instant voice cloning capability of [myshell.ai](https://app.myshell.ai/explore) since May 2023. Until Nov 2023, the voice cloning model has been used tens of millions of times by users worldwide, and witnessed the explosive user growth on the platform.
25 | 
26 | ## Main Contributors
27 | 
28 | - [Zengyi Qin](https://www.qinzy.tech) at MIT and MyShell
29 | - [Wenliang Zhao](https://wl-zhao.github.io) at Tsinghua University
30 | - [Xumin Yu](https://yuxumin.github.io) at Tsinghua University
31 | - [Ethan Sun](https://twitter.com/ethan_myshell) at MyShell
32 | 
33 | ## How to Use
34 | Please see [usage](docs/USAGE.md) for detailed instructions.
35 | 
36 | ## Common Issues
37 | 
38 | Please see [QA](docs/QA.md) for common questions and answers. We will regularly update the question and answer list.
39 | 
40 | ## Join Our Community
41 | 
42 | Join our [Discord community](https://discord.gg/myshell) and select the `Developer` role upon joining to gain exclusive access to our developer-only channel! Don't miss out on valuable discussions and collaboration opportunities.
43 | 
44 | ## Citation
45 | ```
46 | @article{qin2023openvoice,
47 |   title={OpenVoice: Versatile Instant Voice Cloning},
48 |   author={Qin, Zengyi and Zhao, Wenliang and Yu, Xumin and Sun, Xin},
49 |   journal={arXiv preprint arXiv:2312.01479},
50 |   year={2023}
51 | }
52 | ```
53 | 
54 | ## License
55 | This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage. **This will be changed to a license that allows Free Commercial usage in the near future.** Stay tuned. For social responsibility and anti-misuse considerations, **MyShell reserves the ability to detect whether an audio is generated by OpenVoice**, no matter whether the watermark is added or not.
56 | 
57 | ## Acknowledgements
58 | This implementation is based on several excellent projects, [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), and [VITS2](https://github.com/daniilrobnikov/vits2). Thanks for their awesome work!
59 | 


--------------------------------------------------------------------------------
/build_docker.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | import datetime
 3 | import os
 4 | import subprocess
 5 | import logging
 6 | import argparse
 7 | 
 8 | today_tag = datetime.datetime.now().strftime("%d%m%Y")
 9 | 
10 | # Creating argparse parser
11 | parser = argparse.ArgumentParser(description="Build Dockerfile")
12 | parser.add_argument('docker', type=str, help='Name of the Dockerfile to build - should match a folder name in this repo')
13 | parser.add_argument('--username', type=str, default="valyriantech", help=f"Tag to use. Defaults to today's date: valyriantech")
14 | parser.add_argument('--tag', type=str, default=today_tag, help=f"Tag to use. Defaults to today's date: {today_tag}")
15 | parser.add_argument('--latest', action="store_true", help='If specified, we will also tag and push :latest')
16 | args = parser.parse_args()
17 | 
18 | logger = logging.getLogger()
19 | logging.basicConfig(
20 |     format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
21 | )
22 | 
23 | dockerLLM_dir = os.path.dirname(os.path.realpath(__file__))
24 | username = args.username
25 | 
26 | def docker_command(command):
27 |     try:
28 |         logger.info(f"Running docker command: {command}")
29 |         subprocess.check_call(command, shell=True)
30 |     except subprocess.CalledProcessError as e:
31 |         logger.error(f"Got error while executing docker command: {e}")
32 |         raise
33 |     except Exception as e:
34 |         raise e
35 | 
36 | def build(docker_repo, tag, from_docker=None):
37 |     docker_container = f"{username}/{docker_repo}:{tag}"
38 |     logger.info(f"Building and pushing {docker_container}")
39 | 
40 |     docker_build_arg = f"--progress=plain -t {docker_container}"
41 |     if from_docker is not None:
42 |         docker_build_arg += f" --build-arg DOCKER_FROM={from_docker}"
43 | 
44 |     build_command = f"docker build {docker_build_arg} {dockerLLM_dir}/{docker_repo}"
45 |     push_command = f"docker push {docker_container}"
46 | 
47 |     docker_command(build_command)
48 |     docker_command(push_command)
49 | 
50 |     return docker_container
51 | 
52 | def tag(source_container, target_container):
53 |     tag_command = f"docker tag {source_container} {target_container}"
54 |     docker_command(tag_command)
55 |     docker_command(f"docker push {target_container}")
56 | 
57 | 
58 | try:
59 |     container = build(args.docker, args.tag)
60 |     logger.info(f"Successfully built and pushed the container to {container}")
61 | 
62 |     if args.latest:
63 |         latest = f"{username}/{args.docker}:latest"
64 |         tag(container, latest)
65 |         logger.info(f"Successfully tagged and pushed to {latest}")
66 | 
67 | except subprocess.CalledProcessError as e:
68 |     logger.error(f"Process aborted due to error running Docker commands")
69 | except Exception as e:
70 |     raise e
71 | 
72 | 


--------------------------------------------------------------------------------
/default.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/default.wav


--------------------------------------------------------------------------------
/demo/Female-Young-British.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/demo/Female-Young-British.mp3


--------------------------------------------------------------------------------
/demo/Male-MiddleAged_American.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/demo/Male-MiddleAged_American.mp3


--------------------------------------------------------------------------------
/demo/default.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/demo/default.wav


--------------------------------------------------------------------------------
/demo/demo.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | 
  3 | import requests
  4 | import pygame
  5 | 
  6 | # Change this to the host of the OpenVoiceV2_server
  7 | host = 'https://xxxxxxxxxxxx-8000.proxy.runpod.net'
  8 | 
  9 | accents = {
 10 |     'English': 'en-newest',
 11 |     'Australian': 'en-au',
 12 |     'British': 'en-br',
 13 |     'American': 'en-us',
 14 |     'Indian': 'en-india',
 15 |     # 'Default': 'en-default',
 16 |     # 'Spanish': 'es',
 17 |     # 'French': 'fr',
 18 |     # 'Japanese': 'jp',
 19 |     # 'Korean': 'kr',
 20 |     # 'Chinese': 'zh'
 21 | }
 22 | 
 23 | voices = {
 24 |     'default Voice': 'default.wav',
 25 |     'Male Voice': 'Male-MiddleAged_American.mp3',
 26 |     'Female Voice': 'Female-Young-British.mp3',
 27 | }
 28 | 
 29 | sentences = [
 30 |     # "My name is Grace Penelope Targaryen, fourth of my name. I am a super-intelligent AI developed by Valyrian tech.",
 31 |     "Efficient algorithms and robust data structures are essential foundations for building reliable software systems."
 32 | ]
 33 | 
 34 | 
 35 | def upload_audio(file_path: str, audio_file_label: str):
 36 |     url = host + "/upload_audio/"
 37 |     file = open(file_path, "rb")
 38 | 
 39 |     response = requests.post(url, data={"audio_file_label": audio_file_label}, files={"file": file})
 40 |     print(response.status_code)
 41 | 
 42 |     return response.json()
 43 | 
 44 | 
 45 | def synthesize_speech(text: str, voice: str, accent="en-us", speed=1.0):
 46 |     url = host + "/synthesize_speech/"
 47 |     data = {
 48 |         "text": text,
 49 |         "voice": voice,
 50 |         "accent": accent,
 51 |         "speed": speed
 52 |     }
 53 | 
 54 |     response = requests.get(url, params=data)
 55 |     print(f'{response.status_code}\n{round(float(response.headers["X-Elapsed-Time"]), ndigits=2)} seconds to generate.')
 56 |     return response.content
 57 | 
 58 | 
 59 | def play_wav(file_path):
 60 |     # Initialize pygame
 61 |     pygame.init()
 62 | 
 63 |     # Load the WAV file
 64 |     sound = pygame.mixer.Sound(file_path)
 65 | 
 66 |     # Play the sound
 67 |     sound.play()
 68 | 
 69 |     # Wait for the sound to finish playing
 70 |     while pygame.mixer.get_busy():
 71 |         continue
 72 | 
 73 |     # Clean up
 74 |     pygame.quit()
 75 | 
 76 | 
 77 | def base_tts(text, accent: str = 'en-newest', speed=1.0):
 78 |     url = host + "/base_tts/"
 79 |     params = {
 80 |         "text": text,
 81 |         "accent": accent,
 82 |         "speed": speed,
 83 | 
 84 |     }
 85 | 
 86 |     response = requests.get(url, params=params)
 87 |     print(f'{response.status_code}')
 88 | 
 89 |     return response.content
 90 | 
 91 | 
 92 | def change_voice(file_path, reference_speaker):
 93 |     url = host + "/change_voice/"
 94 |     file = open(file_path, "rb")
 95 |     data = {"reference_speaker": reference_speaker}
 96 | 
 97 |     response = requests.post(url, files={"file": file}, data=data)
 98 |     print(response.status_code)
 99 | 
100 |     return response.content
101 | 
102 | 
103 | if __name__ == '__main__':
104 |     print('Running OpenVoiceV2_server demo')
105 | 
106 |     output = base_tts(text="This is the demo for the OpenVoice_server, this is the voice of the base speaker", accent='en-newest', speed=1.0)
107 |     with open("output.wav", "wb") as f:
108 |         f.write(output)
109 |     # Example usage:
110 |     output_file = "output.wav"
111 |     play_wav(output_file)
112 | 
113 |     for voice in voices:
114 |         print(f'\n\nRunning demo for voice: {voice}')
115 |         print(f'Uploading {voices[voice]}')
116 |         upload_audio(file_path=voices[voice], audio_file_label=voice)
117 | 
118 |         time.sleep(1)
119 | 
120 |         print('Testing synthesize_speech')
121 |         for accent_name, accent in accents.items():
122 |             for sentence in sentences:
123 |                 full_sentence = f'As {voice} with a {accent_name} accent: {sentence}'
124 |                 print(f'\nGenerating audio for: {full_sentence}')
125 | 
126 |                 output = synthesize_speech(text=full_sentence, voice=voice, accent=accent)
127 |                 with open("output.wav", "wb") as f:
128 |                     f.write(output)
129 |                 # Example usage:
130 |                 output_file = "output.wav"
131 |                 play_wav(output_file)
132 | 
133 |         print('-' * 50)
134 |         time.sleep(1)
135 |     print('Demo complete')
136 | 


--------------------------------------------------------------------------------
/demo_part1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "b6ee1ede",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Voice Style Control Demo"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "id": "b7f043ee",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import os\n",
 19 |     "import torch\n",
 20 |     "from openvoice import se_extractor\n",
 21 |     "from openvoice.api import BaseSpeakerTTS, ToneColorConverter"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "id": "15116b59",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "### Initialization"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "id": "aacad912",
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "ckpt_base = 'checkpoints/base_speakers/EN'\n",
 40 |     "ckpt_converter = 'checkpoints/converter'\n",
 41 |     "device=\"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
 42 |     "output_dir = 'outputs'\n",
 43 |     "\n",
 44 |     "base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)\n",
 45 |     "base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')\n",
 46 |     "\n",
 47 |     "tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)\n",
 48 |     "tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')\n",
 49 |     "\n",
 50 |     "os.makedirs(output_dir, exist_ok=True)"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "id": "7f67740c",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### Obtain Tone Color Embedding"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "f8add279",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "The `source_se` is the tone color embedding of the base speaker. \n",
 67 |     "It is an average of multiple sentences generated by the base speaker. We directly provide the result here but\n",
 68 |     "the readers feel free to extract `source_se` by themselves."
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "id": "63ff6273",
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "source_se = torch.load(f'{ckpt_base}/en_default_se.pth').to(device)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "id": "4f71fcc3",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "The `reference_speaker.mp3` below points to the short audio clip of the reference whose voice we want to clone. We provide an example here. If you use your own reference speakers, please **make sure each speaker has a unique filename.** The `se_extractor` will save the `targeted_se` using the filename of the audio and **will not automatically overwrite.**"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "id": "55105eae",
 93 |    "metadata": {},
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone\n",
 97 |     "target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "id": "a40284aa",
103 |    "metadata": {},
104 |    "source": [
105 |     "### Inference"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "id": "73dc1259",
112 |    "metadata": {},
113 |    "outputs": [],
114 |    "source": [
115 |     "save_path = f'{output_dir}/output_en_default.wav'\n",
116 |     "\n",
117 |     "# Run the base speaker tts\n",
118 |     "text = \"This audio is generated by OpenVoice.\"\n",
119 |     "src_path = f'{output_dir}/tmp.wav'\n",
120 |     "base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)\n",
121 |     "\n",
122 |     "# Run the tone color converter\n",
123 |     "encode_message = \"@MyShell\"\n",
124 |     "tone_color_converter.convert(\n",
125 |     "    audio_src_path=src_path, \n",
126 |     "    src_se=source_se, \n",
127 |     "    tgt_se=target_se, \n",
128 |     "    output_path=save_path,\n",
129 |     "    message=encode_message)"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "id": "6e3ea28a",
135 |    "metadata": {},
136 |    "source": [
137 |     "**Try with different styles and speed.** The style can be controlled by the `speaker` parameter in the `base_speaker_tts.tts` method. Available choices: friendly, cheerful, excited, sad, angry, terrified, shouting, whispering. Note that the tone color embedding need to be updated. The speed can be controlled by the `speed` parameter. Let's try whispering with speed 0.9."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "id": "fd022d38",
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": [
147 |     "source_se = torch.load(f'{ckpt_base}/en_style_se.pth').to(device)\n",
148 |     "save_path = f'{output_dir}/output_whispering.wav'\n",
149 |     "\n",
150 |     "# Run the base speaker tts\n",
151 |     "text = \"This audio is generated by OpenVoice.\"\n",
152 |     "src_path = f'{output_dir}/tmp.wav'\n",
153 |     "base_speaker_tts.tts(text, src_path, speaker='whispering', language='English', speed=0.9)\n",
154 |     "\n",
155 |     "# Run the tone color converter\n",
156 |     "encode_message = \"@MyShell\"\n",
157 |     "tone_color_converter.convert(\n",
158 |     "    audio_src_path=src_path, \n",
159 |     "    src_se=source_se, \n",
160 |     "    tgt_se=target_se, \n",
161 |     "    output_path=save_path,\n",
162 |     "    message=encode_message)"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "id": "5fcfc70b",
168 |    "metadata": {},
169 |    "source": [
170 |     "**Try with different languages.** OpenVoice can achieve multi-lingual voice cloning by simply replace the base speaker. We provide an example with a Chinese base speaker here and we encourage the readers to try `demo_part2.ipynb` for a detailed demo."
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "id": "a71d1387",
177 |    "metadata": {},
178 |    "outputs": [],
179 |    "source": [
180 |     "\n",
181 |     "ckpt_base = 'checkpoints/base_speakers/ZH'\n",
182 |     "base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)\n",
183 |     "base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')\n",
184 |     "\n",
185 |     "source_se = torch.load(f'{ckpt_base}/zh_default_se.pth').to(device)\n",
186 |     "save_path = f'{output_dir}/output_chinese.wav'\n",
187 |     "\n",
188 |     "# Run the base speaker tts\n",
189 |     "text = \"今天天气真好，我们一起出去吃饭吧。\"\n",
190 |     "src_path = f'{output_dir}/tmp.wav'\n",
191 |     "base_speaker_tts.tts(text, src_path, speaker='default', language='Chinese', speed=1.0)\n",
192 |     "\n",
193 |     "# Run the tone color converter\n",
194 |     "encode_message = \"@MyShell\"\n",
195 |     "tone_color_converter.convert(\n",
196 |     "    audio_src_path=src_path, \n",
197 |     "    src_se=source_se, \n",
198 |     "    tgt_se=target_se, \n",
199 |     "    output_path=save_path,\n",
200 |     "    message=encode_message)"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "id": "8e513094",
206 |    "metadata": {},
207 |    "source": [
208 |     "**Tech for good.** For people who will deploy OpenVoice for public usage: We offer you the option to add watermark to avoid potential misuse. Please see the ToneColorConverter class. **MyShell reserves the ability to detect whether an audio is generated by OpenVoice**, no matter whether the watermark is added or not."
209 |    ]
210 |   }
211 |  ],
212 |  "metadata": {
213 |   "interpreter": {
214 |    "hash": "9d70c38e1c0b038dbdffdaa4f8bfa1f6767c43760905c87a9fbe7800d18c6c35"
215 |   },
216 |   "kernelspec": {
217 |    "display_name": "Python 3 (ipykernel)",
218 |    "language": "python",
219 |    "name": "python3"
220 |   },
221 |   "language_info": {
222 |    "codemirror_mode": {
223 |     "name": "ipython",
224 |     "version": 3
225 |    },
226 |    "file_extension": ".py",
227 |    "mimetype": "text/x-python",
228 |    "name": "python",
229 |    "nbconvert_exporter": "python",
230 |    "pygments_lexer": "ipython3",
231 |    "version": "3.9.18"
232 |   }
233 |  },
234 |  "nbformat": 4,
235 |  "nbformat_minor": 5
236 | }
237 | 


--------------------------------------------------------------------------------
/demo_part2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "b6ee1ede",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Cross-Lingual Voice Clone Demo"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "id": "b7f043ee",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import os\n",
 19 |     "import torch\n",
 20 |     "from openvoice import se_extractor\n",
 21 |     "from openvoice.api import ToneColorConverter"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "id": "15116b59",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "### Initialization"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "id": "aacad912",
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "ckpt_converter = 'checkpoints/converter'\n",
 40 |     "device=\"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
 41 |     "output_dir = 'outputs'\n",
 42 |     "\n",
 43 |     "tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)\n",
 44 |     "tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')\n",
 45 |     "\n",
 46 |     "os.makedirs(output_dir, exist_ok=True)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "id": "3db80fcf",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "In this demo, we will use OpenAI TTS as the base speaker to produce multi-lingual speech audio. The users can flexibly change the base speaker according to their own needs. Please create a file named `.env` and place OpenAI key as `OPENAI_API_KEY=xxx`. We have also provided a Chinese base speaker model (see `demo_part1.ipynb`)."
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "3b245ca3",
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "from openai import OpenAI\n",
 65 |     "from dotenv import load_dotenv\n",
 66 |     "\n",
 67 |     "# Please create a file named .env and place your\n",
 68 |     "# OpenAI key as OPENAI_API_KEY=xxx\n",
 69 |     "load_dotenv() \n",
 70 |     "\n",
 71 |     "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\"))\n",
 72 |     "\n",
 73 |     "response = client.audio.speech.create(\n",
 74 |     "    model=\"tts-1\",\n",
 75 |     "    voice=\"nova\",\n",
 76 |     "    input=\"This audio will be used to extract the base speaker tone color embedding. \" + \\\n",
 77 |     "        \"Typically a very short audio should be sufficient, but increasing the audio \" + \\\n",
 78 |     "        \"length will also improve the output audio quality.\"\n",
 79 |     ")\n",
 80 |     "\n",
 81 |     "response.stream_to_file(f\"{output_dir}/openai_source_output.mp3\")"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "id": "7f67740c",
 87 |    "metadata": {},
 88 |    "source": [
 89 |     "### Obtain Tone Color Embedding"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "id": "f8add279",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "The `source_se` is the tone color embedding of the base speaker. \n",
 98 |     "It is an average for multiple sentences with multiple emotions\n",
 99 |     "of the base speaker. We directly provide the result here but\n",
100 |     "the readers feel free to extract `source_se` by themselves."
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "id": "63ff6273",
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "base_speaker = f\"{output_dir}/openai_source_output.mp3\"\n",
111 |     "source_se, audio_name = se_extractor.get_se(base_speaker, tone_color_converter, vad=True)\n",
112 |     "\n",
113 |     "reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone\n",
114 |     "target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=True)"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "id": "a40284aa",
120 |    "metadata": {},
121 |    "source": [
122 |     "### Inference"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "id": "73dc1259",
129 |    "metadata": {},
130 |    "outputs": [],
131 |    "source": [
132 |     "# Run the base speaker tts\n",
133 |     "text = [\n",
134 |     "    \"MyShell is a decentralized and comprehensive platform for discovering, creating, and staking AI-native apps.\",\n",
135 |     "    \"MyShell es una plataforma descentralizada y completa para descubrir, crear y apostar por aplicaciones nativas de IA.\",\n",
136 |     "    \"MyShell est une plateforme décentralisée et complète pour découvrir, créer et miser sur des applications natives d'IA.\",\n",
137 |     "    \"MyShell ist eine dezentralisierte und umfassende Plattform zum Entdecken, Erstellen und Staken von KI-nativen Apps.\",\n",
138 |     "    \"MyShell è una piattaforma decentralizzata e completa per scoprire, creare e scommettere su app native di intelligenza artificiale.\",\n",
139 |     "    \"MyShellは、AIネイティブアプリの発見、作成、およびステーキングのための分散型かつ包括的なプラットフォームです。\",\n",
140 |     "    \"MyShell — это децентрализованная и всеобъемлющая платформа для обнаружения, создания и стейкинга AI-ориентированных приложений.\",\n",
141 |     "    \"MyShell هي منصة لامركزية وشاملة لاكتشاف وإنشاء ورهان تطبيقات الذكاء الاصطناعي الأصلية.\",\n",
142 |     "    \"MyShell是一个去中心化且全面的平台，用于发现、创建和投资AI原生应用程序。\",\n",
143 |     "    \"MyShell एक विकेंद्रीकृत और व्यापक मंच है, जो AI-मूल ऐप्स की खोज, सृजन और स्टेकिंग के लिए है।\",\n",
144 |     "    \"MyShell é uma plataforma descentralizada e abrangente para descobrir, criar e apostar em aplicativos nativos de IA.\"\n",
145 |     "]\n",
146 |     "src_path = f'{output_dir}/tmp.wav'\n",
147 |     "\n",
148 |     "for i, t in enumerate(text):\n",
149 |     "\n",
150 |     "    response = client.audio.speech.create(\n",
151 |     "        model=\"tts-1\",\n",
152 |     "        voice=\"nova\",\n",
153 |     "        input=t,\n",
154 |     "    )\n",
155 |     "\n",
156 |     "    response.stream_to_file(src_path)\n",
157 |     "\n",
158 |     "    save_path = f'{output_dir}/output_crosslingual_{i}.wav'\n",
159 |     "\n",
160 |     "    # Run the tone color converter\n",
161 |     "    encode_message = \"@MyShell\"\n",
162 |     "    tone_color_converter.convert(\n",
163 |     "        audio_src_path=src_path, \n",
164 |     "        src_se=source_se, \n",
165 |     "        tgt_se=target_se, \n",
166 |     "        output_path=save_path,\n",
167 |     "        message=encode_message)"
168 |    ]
169 |   }
170 |  ],
171 |  "metadata": {
172 |   "interpreter": {
173 |    "hash": "9d70c38e1c0b038dbdffdaa4f8bfa1f6767c43760905c87a9fbe7800d18c6c35"
174 |   },
175 |   "kernelspec": {
176 |    "display_name": "Python 3 (ipykernel)",
177 |    "language": "python",
178 |    "name": "python3"
179 |   },
180 |   "language_info": {
181 |    "codemirror_mode": {
182 |     "name": "ipython",
183 |     "version": 3
184 |    },
185 |    "file_extension": ".py",
186 |    "mimetype": "text/x-python",
187 |    "name": "python",
188 |    "nbconvert_exporter": "python",
189 |    "pygments_lexer": "ipython3",
190 |    "version": "3.9.18"
191 |   }
192 |  },
193 |  "nbformat": 4,
194 |  "nbformat_minor": 5
195 | }
196 | 


--------------------------------------------------------------------------------
/demo_part3.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": null,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "import os\n",
 17 |     "import torch\n",
 18 |     "from openvoice import se_extractor\n",
 19 |     "from openvoice.api import ToneColorConverter"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "### Initialization\n",
 27 |     "\n",
 28 |     "In this example, we will use the checkpoints from OpenVoiceV2. OpenVoiceV2 is trained with more aggressive augmentations and thus demonstrate better robustness in some cases."
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "ckpt_converter = 'checkpoints_v2/converter'\n",
 38 |     "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
 39 |     "output_dir = 'outputs_v2'\n",
 40 |     "\n",
 41 |     "tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)\n",
 42 |     "tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')\n",
 43 |     "\n",
 44 |     "os.makedirs(output_dir, exist_ok=True)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "### Obtain Tone Color Embedding\n",
 52 |     "We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder."
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "\n",
 62 |     "reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone\n",
 63 |     "target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=False)"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "#### Use MeloTTS as Base Speakers\n",
 71 |     "\n",
 72 |     "MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. "
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "from melo.api import TTS\n",
 82 |     "\n",
 83 |     "texts = {\n",
 84 |     "    'EN_NEWEST': \"Did you ever hear a folk tale about a giant turtle?\",  # The newest English base speaker model\n",
 85 |     "    'EN': \"Did you ever hear a folk tale about a giant turtle?\",\n",
 86 |     "    'ES': \"El resplandor del sol acaricia las olas, pintando el cielo con una paleta deslumbrante.\",\n",
 87 |     "    'FR': \"La lueur dorée du soleil caresse les vagues, peignant le ciel d'une palette éblouissante.\",\n",
 88 |     "    'ZH': \"在这次vacation中，我们计划去Paris欣赏埃菲尔铁塔和卢浮宫的美景。\",\n",
 89 |     "    'JP': \"彼は毎朝ジョギングをして体を健康に保っています。\",\n",
 90 |     "    'KR': \"안녕하세요! 오늘은 날씨가 정말 좋네요.\",\n",
 91 |     "}\n",
 92 |     "\n",
 93 |     "\n",
 94 |     "src_path = f'{output_dir}/tmp.wav'\n",
 95 |     "\n",
 96 |     "# Speed is adjustable\n",
 97 |     "speed = 1.0\n",
 98 |     "\n",
 99 |     "for language, text in texts.items():\n",
100 |     "    model = TTS(language=language, device=device)\n",
101 |     "    speaker_ids = model.hps.data.spk2id\n",
102 |     "    \n",
103 |     "    for speaker_key in speaker_ids.keys():\n",
104 |     "        speaker_id = speaker_ids[speaker_key]\n",
105 |     "        speaker_key = speaker_key.lower().replace('_', '-')\n",
106 |     "        \n",
107 |     "        source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key}.pth', map_location=device)\n",
108 |     "        model.tts_to_file(text, speaker_id, src_path, speed=speed)\n",
109 |     "        save_path = f'{output_dir}/output_v2_{speaker_key}.wav'\n",
110 |     "\n",
111 |     "        # Run the tone color converter\n",
112 |     "        encode_message = \"@MyShell\"\n",
113 |     "        tone_color_converter.convert(\n",
114 |     "            audio_src_path=src_path, \n",
115 |     "            src_se=source_se, \n",
116 |     "            tgt_se=target_se, \n",
117 |     "            output_path=save_path,\n",
118 |     "            message=encode_message)"
119 |    ]
120 |   }
121 |  ],
122 |  "metadata": {
123 |   "kernelspec": {
124 |    "display_name": "melo",
125 |    "language": "python",
126 |    "name": "python3"
127 |   },
128 |   "language_info": {
129 |    "codemirror_mode": {
130 |     "name": "ipython",
131 |     "version": 3
132 |    },
133 |    "file_extension": ".py",
134 |    "mimetype": "text/x-python",
135 |    "name": "python",
136 |    "nbconvert_exporter": "python",
137 |    "pygments_lexer": "ipython3",
138 |    "version": "3.9.18"
139 |   }
140 |  },
141 |  "nbformat": 4,
142 |  "nbformat_minor": 2
143 | }
144 | 


--------------------------------------------------------------------------------
/docs/QA.md:
--------------------------------------------------------------------------------
 1 | # Common Questions and Answers
 2 | 
 3 | ## General Comments
 4 | 
 5 | **OpenVoice is a Technology, not a Product**
 6 | 
 7 | Although it works on a majority of voices if used correctly, please do not expect it to work perfectly on every case, as it takes a lot of engineering effort to translate a technology to a stable product. The targeted users of this technology are developers and researchers, not end users. End users expects a perfect product. However, we are confident to say that OpenVoice is the state-of-the-art among the source-available voice cloning technologies.
 8 | 
 9 | The contribution of OpenVoice is a versatile instant voice cloning technical approach, not a ready-to-use perfect voice cloning product. However, we firmly believe that by releasing OpenVoice, we can accelerate the open research community's progress on instant voice cloning, and someday in the future the free voice cloning methods will be as good as commercial ones.
10 | 
11 | ## Issues with Voice Quality
12 | 
13 | **Accent and Emotion of the Generated Voice is not Similar to the Reference Voice**
14 | 
15 | First of all, OpenVoice only clones the tone color of the reference speaker. It does NOT clone the accent or emotion. The accent and emotion is controlled by the base speaker TTS model, not cloned by the tone color converter (please refer to our [paper](https://arxiv.org/pdf/2312.01479.pdf) for technical details). If the user wants to change the accent or emotion of the output, they need to have a base speaker model with that accent. OpenVoice provides sufficient flexibility for users to integrate their own base speaker model into the framework by simply replacing the current base speaker we provided.
16 | 
17 | **Bad Audio Quality of the Generated Speech** 
18 | 
19 | Please check the followings:
20 | - Is your reference audio is clean enough without any background noise? You can find some high-quality reference speech [here](https://aiartes.com/voiceai)
21 | - Is your audio too short?
22 | - Does your audio contain speech from more than one person?
23 | - Does the reference audio contain long blank sections?
24 | - Did you name the reference audio the same name you used before but forgot to delete the `processed` folder?
25 | 
26 | ## Issues with Languages
27 | 
28 | **Support of Other Languages**
29 | 
30 | For multi-lingual and cross-lingual usage, please refer to [`demo_part2.ipynb`](https://github.com/myshell-ai/OpenVoice/blob/main/demo_part2.ipynb). OpenVoice supports any language as long as you have a base speaker in that language. The OpenVoice team already did the most difficult part (tone color converter training) for you. Base speaker TTS model is relatively easy to train, and multiple existing open-source repositories support it. If you don't want to train by yourself, simply use the OpenAI TTS model as the base speaker.
31 | 
32 | ## Issues with Installation
33 | **Error Related to Silero**
34 | 
35 | When calling `get_vad_segments` from `se_extractor.py`, there should be a message like this:
36 | ```
37 | Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /home/user/.cache/torch/hub/master.zip
38 | ```
39 | The download would fail if your machine can not access github. Please download the zip from "https://github.com/snakers4/silero-vad/zipball/master" manually and unzip it to `/home/user/.cache/torch/hub/snakers4_silero-vad_master`. You can also see [this issue](https://github.com/myshell-ai/OpenVoice/issues/57) for solutions for other versions of silero.
40 | 


--------------------------------------------------------------------------------
/docs/USAGE.md:
--------------------------------------------------------------------------------
 1 | # Usage
 2 | 
 3 | ## Table of Content
 4 | 
 5 | - [Quick Use](#quick-use): directly use OpenVoice without installation.
 6 | - [Linux Install](#linux-install): for researchers and developers only.
 7 |     - [V1](#openvoice-v1)
 8 |     - [V2](#openvoice-v2)
 9 | - [Install on Other Platforms](#install-on-other-platforms): unofficial installation guide contributed by the community
10 | 
11 | ## Quick Use
12 | 
13 | The input speech audio of OpenVoice can be in **Any Language**. OpenVoice can clone the voice in that speech audio, and use the voice to speak in multiple languages. For quick use, we recommend you to try the already deployed services:
14 | 
15 | - [British English](https://app.myshell.ai/widget/vYjqae)
16 | - [American English](https://app.myshell.ai/widget/nEFFJf)
17 | - [Indian English](https://app.myshell.ai/widget/V3iYze)
18 | - [Australian English](https://app.myshell.ai/widget/fM7JVf)
19 | - [Spanish](https://app.myshell.ai/widget/NNFFVz)
20 | - [French](https://app.myshell.ai/widget/z2uyUz)
21 | - [Chinese](https://app.myshell.ai/widget/fU7nUz)
22 | - [Japanese](https://app.myshell.ai/widget/IfIB3u)
23 | - [Korean](https://app.myshell.ai/widget/q6ZjIn)
24 | 
25 | ## Minimal Demo
26 | 
27 | For users who want to quickly try OpenVoice and do not require high quality or stability, click any of the following links:
28 | 
29 | <div align="center">
30 |     <a href="https://app.myshell.ai/bot/z6Bvua/1702636181"><img src="../resources/myshell-hd.png" height="28"></a>
31 |     &nbsp;&nbsp;&nbsp;&nbsp;
32 |     <a href="https://huggingface.co/spaces/myshell-ai/OpenVoice"><img src="../resources/huggingface.png" height="32"></a>
33 | </div>
34 | 
35 | ## Linux Install
36 | 
37 | This section is only for developers and researchers who are familiar with Linux, Python and PyTorch. Clone this repo, and run
38 | 
39 | ```
40 | conda create -n openvoice python=3.9
41 | conda activate openvoice
42 | git clone git@github.com:myshell-ai/OpenVoice.git
43 | cd OpenVoice
44 | pip install -e .
45 | ```
46 | 
47 | No matter if you are using V1 or V2, the above installation is the same.
48 | 
49 | ### OpenVoice V1
50 | 
51 | Download the checkpoint from [here](https://myshell-public-repo-hosting.s3.amazonaws.com/openvoice/checkpoints_1226.zip) and extract it to the `checkpoints` folder.
52 | 
53 | **1. Flexible Voice Style Control.**
54 | Please see [`demo_part1.ipynb`](../demo_part1.ipynb) for an example usage of how OpenVoice enables flexible style control over the cloned voice.
55 | 
56 | **2. Cross-Lingual Voice Cloning.**
57 | Please see [`demo_part2.ipynb`](../demo_part2.ipynb) for an example for languages seen or unseen in the MSML training set.
58 | 
59 | **3. Gradio Demo.**. We provide a minimalist local gradio demo here. We strongly suggest the users to look into `demo_part1.ipynb`, `demo_part2.ipynb` and the [QnA](QA.md) if they run into issues with the gradio demo. Launch a local gradio demo with `python -m openvoice_app --share`.
60 | 
61 | ### OpenVoice V2
62 | 
63 | Download the checkpoint from [here](https://myshell-public-repo-hosting.s3.amazonaws.com/openvoice/checkpoints_v2_0417.zip) and extract it to the `checkpoints_v2` folder.
64 | 
65 | Install [MeloTTS](https://github.com/myshell-ai/MeloTTS):
66 | ```
67 | pip install git+https://github.com/myshell-ai/MeloTTS.git
68 | python -m unidic download
69 | ```
70 | 
71 | **Demo Usage.** Please see [`demo_part3.ipynb`](../demo_part3.ipynb) for example usage of OpenVoice V2. Now it natively supports English, Spanish, French, Chinese, Japanese and Korean.
72 | 
73 | 
74 | ## Install on Other Platforms
75 | 
76 | This section provides the unofficial installation guides by open-source contributors in the community:
77 | 
78 | - Windows
79 |   - [Guide](https://github.com/Alienpups/OpenVoice/blob/main/docs/USAGE_WINDOWS.md) by [@Alienpups](https://github.com/Alienpups)
80 |   - You are welcome to contribute if you have a better installation guide. We will list you here.
81 | - Docker
82 |   - [Guide](https://github.com/StevenJSCF/OpenVoice/blob/update-docs/docs/DF_USAGE.md) by [@StevenJSCF](https://github.com/StevenJSCF)
83 |   - You are welcome to contribute if you have a better installation guide. We will list you here.
84 | 


--------------------------------------------------------------------------------
/openvoice/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/openvoice/__init__.py


--------------------------------------------------------------------------------
/openvoice/api.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import numpy as np
  3 | import re
  4 | import soundfile
  5 | from openvoice import utils
  6 | from openvoice import commons
  7 | import os
  8 | import librosa
  9 | from openvoice.text import text_to_sequence
 10 | from openvoice.mel_processing import spectrogram_torch
 11 | from openvoice.models import SynthesizerTrn
 12 | 
 13 | 
 14 | class OpenVoiceBaseClass(object):
 15 |     def __init__(self, 
 16 |                 config_path, 
 17 |                 device='cuda:0'):
 18 |         if 'cuda' in device:
 19 |             assert torch.cuda.is_available()
 20 | 
 21 |         hps = utils.get_hparams_from_file(config_path)
 22 | 
 23 |         model = SynthesizerTrn(
 24 |             len(getattr(hps, 'symbols', [])),
 25 |             hps.data.filter_length // 2 + 1,
 26 |             n_speakers=hps.data.n_speakers,
 27 |             **hps.model,
 28 |         ).to(device)
 29 | 
 30 |         model.eval()
 31 |         self.model = model
 32 |         self.hps = hps
 33 |         self.device = device
 34 | 
 35 |     def load_ckpt(self, ckpt_path):
 36 |         checkpoint_dict = torch.load(ckpt_path, map_location=torch.device(self.device))
 37 |         a, b = self.model.load_state_dict(checkpoint_dict['model'], strict=False)
 38 |         print("Loaded checkpoint '{}'".format(ckpt_path))
 39 |         print('missing/unexpected keys:', a, b)
 40 | 
 41 | 
 42 | class BaseSpeakerTTS(OpenVoiceBaseClass):
 43 |     language_marks = {
 44 |         "english": "EN",
 45 |         "chinese": "ZH",
 46 |     }
 47 | 
 48 |     @staticmethod
 49 |     def get_text(text, hps, is_symbol):
 50 |         text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners)
 51 |         if hps.data.add_blank:
 52 |             text_norm = commons.intersperse(text_norm, 0)
 53 |         text_norm = torch.LongTensor(text_norm)
 54 |         return text_norm
 55 | 
 56 |     @staticmethod
 57 |     def audio_numpy_concat(segment_data_list, sr, speed=1.):
 58 |         audio_segments = []
 59 |         for segment_data in segment_data_list:
 60 |             audio_segments += segment_data.reshape(-1).tolist()
 61 |             audio_segments += [0] * int((sr * 0.05)/speed)
 62 |         audio_segments = np.array(audio_segments).astype(np.float32)
 63 |         return audio_segments
 64 | 
 65 |     @staticmethod
 66 |     def split_sentences_into_pieces(text, language_str):
 67 |         texts = utils.split_sentence(text, language_str=language_str)
 68 |         print(" > Text splitted to sentences.")
 69 |         print('\n'.join(texts))
 70 |         print(" > ===========================")
 71 |         return texts
 72 | 
 73 |     def tts(self, text, output_path, speaker, language='English', speed=1.0):
 74 |         mark = self.language_marks.get(language.lower(), None)
 75 |         assert mark is not None, f"language {language} is not supported"
 76 | 
 77 |         texts = self.split_sentences_into_pieces(text, mark)
 78 | 
 79 |         audio_list = []
 80 |         for t in texts:
 81 |             t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
 82 |             t = f'[{mark}]{t}[{mark}]'
 83 |             stn_tst = self.get_text(t, self.hps, False)
 84 |             device = self.device
 85 |             speaker_id = self.hps.speakers[speaker]
 86 |             with torch.no_grad():
 87 |                 x_tst = stn_tst.unsqueeze(0).to(device)
 88 |                 x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(device)
 89 |                 sid = torch.LongTensor([speaker_id]).to(device)
 90 |                 audio = self.model.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=0.667, noise_scale_w=0.6,
 91 |                                     length_scale=1.0 / speed)[0][0, 0].data.cpu().float().numpy()
 92 |             audio_list.append(audio)
 93 |         audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)
 94 | 
 95 |         if output_path is None:
 96 |             return audio
 97 |         else:
 98 |             soundfile.write(output_path, audio, self.hps.data.sampling_rate)
 99 | 
100 | 
101 | class ToneColorConverter(OpenVoiceBaseClass):
102 |     def __init__(self, *args, **kwargs):
103 |         super().__init__(*args, **kwargs)
104 | 
105 |         if kwargs.get('enable_watermark', True):
106 |             import wavmark
107 |             self.watermark_model = wavmark.load_model().to(self.device)
108 |         else:
109 |             self.watermark_model = None
110 |         self.version = getattr(self.hps, '_version_', "v1")
111 | 
112 | 
113 | 
114 |     def extract_se(self, ref_wav_list, se_save_path=None):
115 |         if isinstance(ref_wav_list, str):
116 |             ref_wav_list = [ref_wav_list]
117 |         
118 |         device = self.device
119 |         hps = self.hps
120 |         gs = []
121 |         
122 |         for fname in ref_wav_list:
123 |             audio_ref, sr = librosa.load(fname, sr=hps.data.sampling_rate)
124 |             y = torch.FloatTensor(audio_ref)
125 |             y = y.to(device)
126 |             y = y.unsqueeze(0)
127 |             y = spectrogram_torch(y, hps.data.filter_length,
128 |                                         hps.data.sampling_rate, hps.data.hop_length, hps.data.win_length,
129 |                                         center=False).to(device)
130 |             with torch.no_grad():
131 |                 g = self.model.ref_enc(y.transpose(1, 2)).unsqueeze(-1)
132 |                 gs.append(g.detach())
133 |         gs = torch.stack(gs).mean(0)
134 | 
135 |         if se_save_path is not None:
136 |             os.makedirs(os.path.dirname(se_save_path), exist_ok=True)
137 |             torch.save(gs.cpu(), se_save_path)
138 | 
139 |         return gs
140 | 
141 |     def convert(self, audio_src_path, src_se, tgt_se, output_path=None, tau=0.3, message="default"):
142 |         hps = self.hps
143 |         # load audio
144 |         audio, sample_rate = librosa.load(audio_src_path, sr=hps.data.sampling_rate)
145 |         audio = torch.tensor(audio).float()
146 |         
147 |         with torch.no_grad():
148 |             y = torch.FloatTensor(audio).to(self.device)
149 |             y = y.unsqueeze(0)
150 |             spec = spectrogram_torch(y, hps.data.filter_length,
151 |                                     hps.data.sampling_rate, hps.data.hop_length, hps.data.win_length,
152 |                                     center=False).to(self.device)
153 |             spec_lengths = torch.LongTensor([spec.size(-1)]).to(self.device)
154 |             audio = self.model.voice_conversion(spec, spec_lengths, sid_src=src_se, sid_tgt=tgt_se, tau=tau)[0][
155 |                         0, 0].data.cpu().float().numpy()
156 |             audio = self.add_watermark(audio, message)
157 |             if output_path is None:
158 |                 return audio
159 |             else:
160 |                 soundfile.write(output_path, audio, hps.data.sampling_rate)
161 |     
162 |     def add_watermark(self, audio, message):
163 |         if self.watermark_model is None:
164 |             return audio
165 |         device = self.device
166 |         bits = utils.string_to_bits(message).reshape(-1)
167 |         n_repeat = len(bits) // 32
168 | 
169 |         K = 16000
170 |         coeff = 2
171 |         for n in range(n_repeat):
172 |             trunck = audio[(coeff * n) * K: (coeff * n + 1) * K]
173 |             if len(trunck) != K:
174 |                 print('Audio too short, fail to add watermark')
175 |                 break
176 |             message_npy = bits[n * 32: (n + 1) * 32]
177 |             
178 |             with torch.no_grad():
179 |                 signal = torch.FloatTensor(trunck).to(device)[None]
180 |                 message_tensor = torch.FloatTensor(message_npy).to(device)[None]
181 |                 signal_wmd_tensor = self.watermark_model.encode(signal, message_tensor)
182 |                 signal_wmd_npy = signal_wmd_tensor.detach().cpu().squeeze()
183 |             audio[(coeff * n) * K: (coeff * n + 1) * K] = signal_wmd_npy
184 |         return audio
185 | 
186 |     def detect_watermark(self, audio, n_repeat):
187 |         bits = []
188 |         K = 16000
189 |         coeff = 2
190 |         for n in range(n_repeat):
191 |             trunck = audio[(coeff * n) * K: (coeff * n + 1) * K]
192 |             if len(trunck) != K:
193 |                 print('Audio too short, fail to detect watermark')
194 |                 return 'Fail'
195 |             with torch.no_grad():
196 |                 signal = torch.FloatTensor(trunck).to(self.device).unsqueeze(0)
197 |                 message_decoded_npy = (self.watermark_model.decode(signal) >= 0.5).int().detach().cpu().numpy().squeeze()
198 |             bits.append(message_decoded_npy)
199 |         bits = np.stack(bits).reshape(-1, 8)
200 |         message = utils.bits_to_string(bits)
201 |         return message
202 |     
203 | 


--------------------------------------------------------------------------------
/openvoice/attentions.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | from torch import nn
  4 | from torch.nn import functional as F
  5 | 
  6 | from openvoice import commons
  7 | import logging
  8 | 
  9 | logger = logging.getLogger(__name__)
 10 | 
 11 | 
 12 | class LayerNorm(nn.Module):
 13 |     def __init__(self, channels, eps=1e-5):
 14 |         super().__init__()
 15 |         self.channels = channels
 16 |         self.eps = eps
 17 | 
 18 |         self.gamma = nn.Parameter(torch.ones(channels))
 19 |         self.beta = nn.Parameter(torch.zeros(channels))
 20 | 
 21 |     def forward(self, x):
 22 |         x = x.transpose(1, -1)
 23 |         x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
 24 |         return x.transpose(1, -1)
 25 | 
 26 | 
 27 | @torch.jit.script
 28 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
 29 |     n_channels_int = n_channels[0]
 30 |     in_act = input_a + input_b
 31 |     t_act = torch.tanh(in_act[:, :n_channels_int, :])
 32 |     s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
 33 |     acts = t_act * s_act
 34 |     return acts
 35 | 
 36 | 
 37 | class Encoder(nn.Module):
 38 |     def __init__(
 39 |         self,
 40 |         hidden_channels,
 41 |         filter_channels,
 42 |         n_heads,
 43 |         n_layers,
 44 |         kernel_size=1,
 45 |         p_dropout=0.0,
 46 |         window_size=4,
 47 |         isflow=True,
 48 |         **kwargs
 49 |     ):
 50 |         super().__init__()
 51 |         self.hidden_channels = hidden_channels
 52 |         self.filter_channels = filter_channels
 53 |         self.n_heads = n_heads
 54 |         self.n_layers = n_layers
 55 |         self.kernel_size = kernel_size
 56 |         self.p_dropout = p_dropout
 57 |         self.window_size = window_size
 58 |         # if isflow:
 59 |         #  cond_layer = torch.nn.Conv1d(256, 2*hidden_channels*n_layers, 1)
 60 |         #  self.cond_pre = torch.nn.Conv1d(hidden_channels, 2*hidden_channels, 1)
 61 |         #  self.cond_layer = weight_norm(cond_layer, name='weight')
 62 |         #  self.gin_channels = 256
 63 |         self.cond_layer_idx = self.n_layers
 64 |         if "gin_channels" in kwargs:
 65 |             self.gin_channels = kwargs["gin_channels"]
 66 |             if self.gin_channels != 0:
 67 |                 self.spk_emb_linear = nn.Linear(self.gin_channels, self.hidden_channels)
 68 |                 # vits2 says 3rd block, so idx is 2 by default
 69 |                 self.cond_layer_idx = (
 70 |                     kwargs["cond_layer_idx"] if "cond_layer_idx" in kwargs else 2
 71 |                 )
 72 |                 # logging.debug(self.gin_channels, self.cond_layer_idx)
 73 |                 assert (
 74 |                     self.cond_layer_idx < self.n_layers
 75 |                 ), "cond_layer_idx should be less than n_layers"
 76 |         self.drop = nn.Dropout(p_dropout)
 77 |         self.attn_layers = nn.ModuleList()
 78 |         self.norm_layers_1 = nn.ModuleList()
 79 |         self.ffn_layers = nn.ModuleList()
 80 |         self.norm_layers_2 = nn.ModuleList()
 81 | 
 82 |         for i in range(self.n_layers):
 83 |             self.attn_layers.append(
 84 |                 MultiHeadAttention(
 85 |                     hidden_channels,
 86 |                     hidden_channels,
 87 |                     n_heads,
 88 |                     p_dropout=p_dropout,
 89 |                     window_size=window_size,
 90 |                 )
 91 |             )
 92 |             self.norm_layers_1.append(LayerNorm(hidden_channels))
 93 |             self.ffn_layers.append(
 94 |                 FFN(
 95 |                     hidden_channels,
 96 |                     hidden_channels,
 97 |                     filter_channels,
 98 |                     kernel_size,
 99 |                     p_dropout=p_dropout,
100 |                 )
101 |             )
102 |             self.norm_layers_2.append(LayerNorm(hidden_channels))
103 | 
104 |     def forward(self, x, x_mask, g=None):
105 |         attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
106 |         x = x * x_mask
107 |         for i in range(self.n_layers):
108 |             if i == self.cond_layer_idx and g is not None:
109 |                 g = self.spk_emb_linear(g.transpose(1, 2))
110 |                 g = g.transpose(1, 2)
111 |                 x = x + g
112 |                 x = x * x_mask
113 |             y = self.attn_layers[i](x, x, attn_mask)
114 |             y = self.drop(y)
115 |             x = self.norm_layers_1[i](x + y)
116 | 
117 |             y = self.ffn_layers[i](x, x_mask)
118 |             y = self.drop(y)
119 |             x = self.norm_layers_2[i](x + y)
120 |         x = x * x_mask
121 |         return x
122 | 
123 | 
124 | class Decoder(nn.Module):
125 |     def __init__(
126 |         self,
127 |         hidden_channels,
128 |         filter_channels,
129 |         n_heads,
130 |         n_layers,
131 |         kernel_size=1,
132 |         p_dropout=0.0,
133 |         proximal_bias=False,
134 |         proximal_init=True,
135 |         **kwargs
136 |     ):
137 |         super().__init__()
138 |         self.hidden_channels = hidden_channels
139 |         self.filter_channels = filter_channels
140 |         self.n_heads = n_heads
141 |         self.n_layers = n_layers
142 |         self.kernel_size = kernel_size
143 |         self.p_dropout = p_dropout
144 |         self.proximal_bias = proximal_bias
145 |         self.proximal_init = proximal_init
146 | 
147 |         self.drop = nn.Dropout(p_dropout)
148 |         self.self_attn_layers = nn.ModuleList()
149 |         self.norm_layers_0 = nn.ModuleList()
150 |         self.encdec_attn_layers = nn.ModuleList()
151 |         self.norm_layers_1 = nn.ModuleList()
152 |         self.ffn_layers = nn.ModuleList()
153 |         self.norm_layers_2 = nn.ModuleList()
154 |         for i in range(self.n_layers):
155 |             self.self_attn_layers.append(
156 |                 MultiHeadAttention(
157 |                     hidden_channels,
158 |                     hidden_channels,
159 |                     n_heads,
160 |                     p_dropout=p_dropout,
161 |                     proximal_bias=proximal_bias,
162 |                     proximal_init=proximal_init,
163 |                 )
164 |             )
165 |             self.norm_layers_0.append(LayerNorm(hidden_channels))
166 |             self.encdec_attn_layers.append(
167 |                 MultiHeadAttention(
168 |                     hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout
169 |                 )
170 |             )
171 |             self.norm_layers_1.append(LayerNorm(hidden_channels))
172 |             self.ffn_layers.append(
173 |                 FFN(
174 |                     hidden_channels,
175 |                     hidden_channels,
176 |                     filter_channels,
177 |                     kernel_size,
178 |                     p_dropout=p_dropout,
179 |                     causal=True,
180 |                 )
181 |             )
182 |             self.norm_layers_2.append(LayerNorm(hidden_channels))
183 | 
184 |     def forward(self, x, x_mask, h, h_mask):
185 |         """
186 |         x: decoder input
187 |         h: encoder output
188 |         """
189 |         self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(
190 |             device=x.device, dtype=x.dtype
191 |         )
192 |         encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
193 |         x = x * x_mask
194 |         for i in range(self.n_layers):
195 |             y = self.self_attn_layers[i](x, x, self_attn_mask)
196 |             y = self.drop(y)
197 |             x = self.norm_layers_0[i](x + y)
198 | 
199 |             y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
200 |             y = self.drop(y)
201 |             x = self.norm_layers_1[i](x + y)
202 | 
203 |             y = self.ffn_layers[i](x, x_mask)
204 |             y = self.drop(y)
205 |             x = self.norm_layers_2[i](x + y)
206 |         x = x * x_mask
207 |         return x
208 | 
209 | 
210 | class MultiHeadAttention(nn.Module):
211 |     def __init__(
212 |         self,
213 |         channels,
214 |         out_channels,
215 |         n_heads,
216 |         p_dropout=0.0,
217 |         window_size=None,
218 |         heads_share=True,
219 |         block_length=None,
220 |         proximal_bias=False,
221 |         proximal_init=False,
222 |     ):
223 |         super().__init__()
224 |         assert channels % n_heads == 0
225 | 
226 |         self.channels = channels
227 |         self.out_channels = out_channels
228 |         self.n_heads = n_heads
229 |         self.p_dropout = p_dropout
230 |         self.window_size = window_size
231 |         self.heads_share = heads_share
232 |         self.block_length = block_length
233 |         self.proximal_bias = proximal_bias
234 |         self.proximal_init = proximal_init
235 |         self.attn = None
236 | 
237 |         self.k_channels = channels // n_heads
238 |         self.conv_q = nn.Conv1d(channels, channels, 1)
239 |         self.conv_k = nn.Conv1d(channels, channels, 1)
240 |         self.conv_v = nn.Conv1d(channels, channels, 1)
241 |         self.conv_o = nn.Conv1d(channels, out_channels, 1)
242 |         self.drop = nn.Dropout(p_dropout)
243 | 
244 |         if window_size is not None:
245 |             n_heads_rel = 1 if heads_share else n_heads
246 |             rel_stddev = self.k_channels**-0.5
247 |             self.emb_rel_k = nn.Parameter(
248 |                 torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
249 |                 * rel_stddev
250 |             )
251 |             self.emb_rel_v = nn.Parameter(
252 |                 torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
253 |                 * rel_stddev
254 |             )
255 | 
256 |         nn.init.xavier_uniform_(self.conv_q.weight)
257 |         nn.init.xavier_uniform_(self.conv_k.weight)
258 |         nn.init.xavier_uniform_(self.conv_v.weight)
259 |         if proximal_init:
260 |             with torch.no_grad():
261 |                 self.conv_k.weight.copy_(self.conv_q.weight)
262 |                 self.conv_k.bias.copy_(self.conv_q.bias)
263 | 
264 |     def forward(self, x, c, attn_mask=None):
265 |         q = self.conv_q(x)
266 |         k = self.conv_k(c)
267 |         v = self.conv_v(c)
268 | 
269 |         x, self.attn = self.attention(q, k, v, mask=attn_mask)
270 | 
271 |         x = self.conv_o(x)
272 |         return x
273 | 
274 |     def attention(self, query, key, value, mask=None):
275 |         # reshape [b, d, t] -> [b, n_h, t, d_k]
276 |         b, d, t_s, t_t = (*key.size(), query.size(2))
277 |         query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
278 |         key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
279 |         value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
280 | 
281 |         scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1))
282 |         if self.window_size is not None:
283 |             assert (
284 |                 t_s == t_t
285 |             ), "Relative attention is only available for self-attention."
286 |             key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
287 |             rel_logits = self._matmul_with_relative_keys(
288 |                 query / math.sqrt(self.k_channels), key_relative_embeddings
289 |             )
290 |             scores_local = self._relative_position_to_absolute_position(rel_logits)
291 |             scores = scores + scores_local
292 |         if self.proximal_bias:
293 |             assert t_s == t_t, "Proximal bias is only available for self-attention."
294 |             scores = scores + self._attention_bias_proximal(t_s).to(
295 |                 device=scores.device, dtype=scores.dtype
296 |             )
297 |         if mask is not None:
298 |             scores = scores.masked_fill(mask == 0, -1e4)
299 |             if self.block_length is not None:
300 |                 assert (
301 |                     t_s == t_t
302 |                 ), "Local attention is only available for self-attention."
303 |                 block_mask = (
304 |                     torch.ones_like(scores)
305 |                     .triu(-self.block_length)
306 |                     .tril(self.block_length)
307 |                 )
308 |                 scores = scores.masked_fill(block_mask == 0, -1e4)
309 |         p_attn = F.softmax(scores, dim=-1)  # [b, n_h, t_t, t_s]
310 |         p_attn = self.drop(p_attn)
311 |         output = torch.matmul(p_attn, value)
312 |         if self.window_size is not None:
313 |             relative_weights = self._absolute_position_to_relative_position(p_attn)
314 |             value_relative_embeddings = self._get_relative_embeddings(
315 |                 self.emb_rel_v, t_s
316 |             )
317 |             output = output + self._matmul_with_relative_values(
318 |                 relative_weights, value_relative_embeddings
319 |             )
320 |         output = (
321 |             output.transpose(2, 3).contiguous().view(b, d, t_t)
322 |         )  # [b, n_h, t_t, d_k] -> [b, d, t_t]
323 |         return output, p_attn
324 | 
325 |     def _matmul_with_relative_values(self, x, y):
326 |         """
327 |         x: [b, h, l, m]
328 |         y: [h or 1, m, d]
329 |         ret: [b, h, l, d]
330 |         """
331 |         ret = torch.matmul(x, y.unsqueeze(0))
332 |         return ret
333 | 
334 |     def _matmul_with_relative_keys(self, x, y):
335 |         """
336 |         x: [b, h, l, d]
337 |         y: [h or 1, m, d]
338 |         ret: [b, h, l, m]
339 |         """
340 |         ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
341 |         return ret
342 | 
343 |     def _get_relative_embeddings(self, relative_embeddings, length):
344 |         2 * self.window_size + 1
345 |         # Pad first before slice to avoid using cond ops.
346 |         pad_length = max(length - (self.window_size + 1), 0)
347 |         slice_start_position = max((self.window_size + 1) - length, 0)
348 |         slice_end_position = slice_start_position + 2 * length - 1
349 |         if pad_length > 0:
350 |             padded_relative_embeddings = F.pad(
351 |                 relative_embeddings,
352 |                 commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]),
353 |             )
354 |         else:
355 |             padded_relative_embeddings = relative_embeddings
356 |         used_relative_embeddings = padded_relative_embeddings[
357 |             :, slice_start_position:slice_end_position
358 |         ]
359 |         return used_relative_embeddings
360 | 
361 |     def _relative_position_to_absolute_position(self, x):
362 |         """
363 |         x: [b, h, l, 2*l-1]
364 |         ret: [b, h, l, l]
365 |         """
366 |         batch, heads, length, _ = x.size()
367 |         # Concat columns of pad to shift from relative to absolute indexing.
368 |         x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
369 | 
370 |         # Concat extra elements so to add up to shape (len+1, 2*len-1).
371 |         x_flat = x.view([batch, heads, length * 2 * length])
372 |         x_flat = F.pad(
373 |             x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [0, length - 1]])
374 |         )
375 | 
376 |         # Reshape and slice out the padded elements.
377 |         x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[
378 |             :, :, :length, length - 1 :
379 |         ]
380 |         return x_final
381 | 
382 |     def _absolute_position_to_relative_position(self, x):
383 |         """
384 |         x: [b, h, l, l]
385 |         ret: [b, h, l, 2*l-1]
386 |         """
387 |         batch, heads, length, _ = x.size()
388 |         # pad along column
389 |         x = F.pad(
390 |             x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]])
391 |         )
392 |         x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
393 |         # add 0's in the beginning that will skew the elements after reshape
394 |         x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
395 |         x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
396 |         return x_final
397 | 
398 |     def _attention_bias_proximal(self, length):
399 |         """Bias for self-attention to encourage attention to close positions.
400 |         Args:
401 |           length: an integer scalar.
402 |         Returns:
403 |           a Tensor with shape [1, 1, length, length]
404 |         """
405 |         r = torch.arange(length, dtype=torch.float32)
406 |         diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
407 |         return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
408 | 
409 | 
410 | class FFN(nn.Module):
411 |     def __init__(
412 |         self,
413 |         in_channels,
414 |         out_channels,
415 |         filter_channels,
416 |         kernel_size,
417 |         p_dropout=0.0,
418 |         activation=None,
419 |         causal=False,
420 |     ):
421 |         super().__init__()
422 |         self.in_channels = in_channels
423 |         self.out_channels = out_channels
424 |         self.filter_channels = filter_channels
425 |         self.kernel_size = kernel_size
426 |         self.p_dropout = p_dropout
427 |         self.activation = activation
428 |         self.causal = causal
429 | 
430 |         if causal:
431 |             self.padding = self._causal_padding
432 |         else:
433 |             self.padding = self._same_padding
434 | 
435 |         self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
436 |         self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
437 |         self.drop = nn.Dropout(p_dropout)
438 | 
439 |     def forward(self, x, x_mask):
440 |         x = self.conv_1(self.padding(x * x_mask))
441 |         if self.activation == "gelu":
442 |             x = x * torch.sigmoid(1.702 * x)
443 |         else:
444 |             x = torch.relu(x)
445 |         x = self.drop(x)
446 |         x = self.conv_2(self.padding(x * x_mask))
447 |         return x * x_mask
448 | 
449 |     def _causal_padding(self, x):
450 |         if self.kernel_size == 1:
451 |             return x
452 |         pad_l = self.kernel_size - 1
453 |         pad_r = 0
454 |         padding = [[0, 0], [0, 0], [pad_l, pad_r]]
455 |         x = F.pad(x, commons.convert_pad_shape(padding))
456 |         return x
457 | 
458 |     def _same_padding(self, x):
459 |         if self.kernel_size == 1:
460 |             return x
461 |         pad_l = (self.kernel_size - 1) // 2
462 |         pad_r = self.kernel_size // 2
463 |         padding = [[0, 0], [0, 0], [pad_l, pad_r]]
464 |         x = F.pad(x, commons.convert_pad_shape(padding))
465 |         return x
466 | 


--------------------------------------------------------------------------------
/openvoice/commons.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | from torch.nn import functional as F
  4 | 
  5 | 
  6 | def init_weights(m, mean=0.0, std=0.01):
  7 |     classname = m.__class__.__name__
  8 |     if classname.find("Conv") != -1:
  9 |         m.weight.data.normal_(mean, std)
 10 | 
 11 | 
 12 | def get_padding(kernel_size, dilation=1):
 13 |     return int((kernel_size * dilation - dilation) / 2)
 14 | 
 15 | 
 16 | def convert_pad_shape(pad_shape):
 17 |     layer = pad_shape[::-1]
 18 |     pad_shape = [item for sublist in layer for item in sublist]
 19 |     return pad_shape
 20 | 
 21 | 
 22 | def intersperse(lst, item):
 23 |     result = [item] * (len(lst) * 2 + 1)
 24 |     result[1::2] = lst
 25 |     return result
 26 | 
 27 | 
 28 | def kl_divergence(m_p, logs_p, m_q, logs_q):
 29 |     """KL(P||Q)"""
 30 |     kl = (logs_q - logs_p) - 0.5
 31 |     kl += (
 32 |         0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
 33 |     )
 34 |     return kl
 35 | 
 36 | 
 37 | def rand_gumbel(shape):
 38 |     """Sample from the Gumbel distribution, protect from overflows."""
 39 |     uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
 40 |     return -torch.log(-torch.log(uniform_samples))
 41 | 
 42 | 
 43 | def rand_gumbel_like(x):
 44 |     g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
 45 |     return g
 46 | 
 47 | 
 48 | def slice_segments(x, ids_str, segment_size=4):
 49 |     ret = torch.zeros_like(x[:, :, :segment_size])
 50 |     for i in range(x.size(0)):
 51 |         idx_str = ids_str[i]
 52 |         idx_end = idx_str + segment_size
 53 |         ret[i] = x[i, :, idx_str:idx_end]
 54 |     return ret
 55 | 
 56 | 
 57 | def rand_slice_segments(x, x_lengths=None, segment_size=4):
 58 |     b, d, t = x.size()
 59 |     if x_lengths is None:
 60 |         x_lengths = t
 61 |     ids_str_max = x_lengths - segment_size + 1
 62 |     ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
 63 |     ret = slice_segments(x, ids_str, segment_size)
 64 |     return ret, ids_str
 65 | 
 66 | 
 67 | def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
 68 |     position = torch.arange(length, dtype=torch.float)
 69 |     num_timescales = channels // 2
 70 |     log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
 71 |         num_timescales - 1
 72 |     )
 73 |     inv_timescales = min_timescale * torch.exp(
 74 |         torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
 75 |     )
 76 |     scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
 77 |     signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
 78 |     signal = F.pad(signal, [0, 0, 0, channels % 2])
 79 |     signal = signal.view(1, channels, length)
 80 |     return signal
 81 | 
 82 | 
 83 | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
 84 |     b, channels, length = x.size()
 85 |     signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
 86 |     return x + signal.to(dtype=x.dtype, device=x.device)
 87 | 
 88 | 
 89 | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
 90 |     b, channels, length = x.size()
 91 |     signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
 92 |     return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
 93 | 
 94 | 
 95 | def subsequent_mask(length):
 96 |     mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
 97 |     return mask
 98 | 
 99 | 
100 | @torch.jit.script
101 | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
102 |     n_channels_int = n_channels[0]
103 |     in_act = input_a + input_b
104 |     t_act = torch.tanh(in_act[:, :n_channels_int, :])
105 |     s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
106 |     acts = t_act * s_act
107 |     return acts
108 | 
109 | 
110 | def convert_pad_shape(pad_shape):
111 |     layer = pad_shape[::-1]
112 |     pad_shape = [item for sublist in layer for item in sublist]
113 |     return pad_shape
114 | 
115 | 
116 | def shift_1d(x):
117 |     x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
118 |     return x
119 | 
120 | 
121 | def sequence_mask(length, max_length=None):
122 |     if max_length is None:
123 |         max_length = length.max()
124 |     x = torch.arange(max_length, dtype=length.dtype, device=length.device)
125 |     return x.unsqueeze(0) < length.unsqueeze(1)
126 | 
127 | 
128 | def generate_path(duration, mask):
129 |     """
130 |     duration: [b, 1, t_x]
131 |     mask: [b, 1, t_y, t_x]
132 |     """
133 | 
134 |     b, _, t_y, t_x = mask.shape
135 |     cum_duration = torch.cumsum(duration, -1)
136 | 
137 |     cum_duration_flat = cum_duration.view(b * t_x)
138 |     path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
139 |     path = path.view(b, t_x, t_y)
140 |     path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
141 |     path = path.unsqueeze(1).transpose(2, 3) * mask
142 |     return path
143 | 
144 | 
145 | def clip_grad_value_(parameters, clip_value, norm_type=2):
146 |     if isinstance(parameters, torch.Tensor):
147 |         parameters = [parameters]
148 |     parameters = list(filter(lambda p: p.grad is not None, parameters))
149 |     norm_type = float(norm_type)
150 |     if clip_value is not None:
151 |         clip_value = float(clip_value)
152 | 
153 |     total_norm = 0
154 |     for p in parameters:
155 |         param_norm = p.grad.data.norm(norm_type)
156 |         total_norm += param_norm.item() ** norm_type
157 |         if clip_value is not None:
158 |             p.grad.data.clamp_(min=-clip_value, max=clip_value)
159 |     total_norm = total_norm ** (1.0 / norm_type)
160 |     return total_norm
161 | 


--------------------------------------------------------------------------------
/openvoice/mel_processing.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.utils.data
  3 | from librosa.filters import mel as librosa_mel_fn
  4 | 
  5 | MAX_WAV_VALUE = 32768.0
  6 | 
  7 | 
  8 | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
  9 |     """
 10 |     PARAMS
 11 |     ------
 12 |     C: compression factor
 13 |     """
 14 |     return torch.log(torch.clamp(x, min=clip_val) * C)
 15 | 
 16 | 
 17 | def dynamic_range_decompression_torch(x, C=1):
 18 |     """
 19 |     PARAMS
 20 |     ------
 21 |     C: compression factor used to compress
 22 |     """
 23 |     return torch.exp(x) / C
 24 | 
 25 | 
 26 | def spectral_normalize_torch(magnitudes):
 27 |     output = dynamic_range_compression_torch(magnitudes)
 28 |     return output
 29 | 
 30 | 
 31 | def spectral_de_normalize_torch(magnitudes):
 32 |     output = dynamic_range_decompression_torch(magnitudes)
 33 |     return output
 34 | 
 35 | 
 36 | mel_basis = {}
 37 | hann_window = {}
 38 | 
 39 | 
 40 | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
 41 |     if torch.min(y) < -1.1:
 42 |         print("min value is ", torch.min(y))
 43 |     if torch.max(y) > 1.1:
 44 |         print("max value is ", torch.max(y))
 45 | 
 46 |     global hann_window
 47 |     dtype_device = str(y.dtype) + "_" + str(y.device)
 48 |     wnsize_dtype_device = str(win_size) + "_" + dtype_device
 49 |     if wnsize_dtype_device not in hann_window:
 50 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
 51 |             dtype=y.dtype, device=y.device
 52 |         )
 53 | 
 54 |     y = torch.nn.functional.pad(
 55 |         y.unsqueeze(1),
 56 |         (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
 57 |         mode="reflect",
 58 |     )
 59 |     y = y.squeeze(1)
 60 | 
 61 |     spec = torch.stft(
 62 |         y,
 63 |         n_fft,
 64 |         hop_length=hop_size,
 65 |         win_length=win_size,
 66 |         window=hann_window[wnsize_dtype_device],
 67 |         center=center,
 68 |         pad_mode="reflect",
 69 |         normalized=False,
 70 |         onesided=True,
 71 |         return_complex=False,
 72 |     )
 73 | 
 74 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
 75 |     return spec
 76 | 
 77 | 
 78 | def spectrogram_torch_conv(y, n_fft, sampling_rate, hop_size, win_size, center=False):
 79 |     # if torch.min(y) < -1.:
 80 |     #     print('min value is ', torch.min(y))
 81 |     # if torch.max(y) > 1.:
 82 |     #     print('max value is ', torch.max(y))
 83 | 
 84 |     global hann_window
 85 |     dtype_device = str(y.dtype) + '_' + str(y.device)
 86 |     wnsize_dtype_device = str(win_size) + '_' + dtype_device
 87 |     if wnsize_dtype_device not in hann_window:
 88 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)
 89 | 
 90 |     y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
 91 |     
 92 |     # ******************** original ************************#
 93 |     # y = y.squeeze(1)
 94 |     # spec1 = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
 95 |     #                   center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
 96 | 
 97 |     # ******************** ConvSTFT ************************#
 98 |     freq_cutoff = n_fft // 2 + 1
 99 |     fourier_basis = torch.view_as_real(torch.fft.fft(torch.eye(n_fft)))
100 |     forward_basis = fourier_basis[:freq_cutoff].permute(2, 0, 1).reshape(-1, 1, fourier_basis.shape[1])
101 |     forward_basis = forward_basis * torch.as_tensor(librosa.util.pad_center(torch.hann_window(win_size), size=n_fft)).float()
102 | 
103 |     import torch.nn.functional as F
104 | 
105 |     # if center:
106 |     #     signal = F.pad(y[:, None, None, :], (n_fft // 2, n_fft // 2, 0, 0), mode = 'reflect').squeeze(1)
107 |     assert center is False
108 | 
109 |     forward_transform_squared = F.conv1d(y, forward_basis.to(y.device), stride = hop_size)
110 |     spec2 = torch.stack([forward_transform_squared[:, :freq_cutoff, :], forward_transform_squared[:, freq_cutoff:, :]], dim = -1)
111 | 
112 | 
113 |     # ******************** Verification ************************#
114 |     spec1 = torch.stft(y.squeeze(1), n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
115 |                       center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
116 |     assert torch.allclose(spec1, spec2, atol=1e-4)
117 | 
118 |     spec = torch.sqrt(spec2.pow(2).sum(-1) + 1e-6)
119 |     return spec
120 | 
121 | 
122 | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
123 |     global mel_basis
124 |     dtype_device = str(spec.dtype) + "_" + str(spec.device)
125 |     fmax_dtype_device = str(fmax) + "_" + dtype_device
126 |     if fmax_dtype_device not in mel_basis:
127 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
128 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
129 |             dtype=spec.dtype, device=spec.device
130 |         )
131 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
132 |     spec = spectral_normalize_torch(spec)
133 |     return spec
134 | 
135 | 
136 | def mel_spectrogram_torch(
137 |     y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
138 | ):
139 |     if torch.min(y) < -1.0:
140 |         print("min value is ", torch.min(y))
141 |     if torch.max(y) > 1.0:
142 |         print("max value is ", torch.max(y))
143 | 
144 |     global mel_basis, hann_window
145 |     dtype_device = str(y.dtype) + "_" + str(y.device)
146 |     fmax_dtype_device = str(fmax) + "_" + dtype_device
147 |     wnsize_dtype_device = str(win_size) + "_" + dtype_device
148 |     if fmax_dtype_device not in mel_basis:
149 |         mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
150 |         mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(
151 |             dtype=y.dtype, device=y.device
152 |         )
153 |     if wnsize_dtype_device not in hann_window:
154 |         hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(
155 |             dtype=y.dtype, device=y.device
156 |         )
157 | 
158 |     y = torch.nn.functional.pad(
159 |         y.unsqueeze(1),
160 |         (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
161 |         mode="reflect",
162 |     )
163 |     y = y.squeeze(1)
164 | 
165 |     spec = torch.stft(
166 |         y,
167 |         n_fft,
168 |         hop_length=hop_size,
169 |         win_length=win_size,
170 |         window=hann_window[wnsize_dtype_device],
171 |         center=center,
172 |         pad_mode="reflect",
173 |         normalized=False,
174 |         onesided=True,
175 |         return_complex=False,
176 |     )
177 | 
178 |     spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
179 | 
180 |     spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
181 |     spec = spectral_normalize_torch(spec)
182 | 
183 |     return spec


--------------------------------------------------------------------------------
/openvoice/models.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | from torch import nn
  4 | from torch.nn import functional as F
  5 | 
  6 | from openvoice import commons
  7 | from openvoice import modules
  8 | from openvoice import attentions
  9 | 
 10 | from torch.nn import Conv1d, ConvTranspose1d, Conv2d
 11 | from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
 12 | 
 13 | from openvoice.commons import init_weights, get_padding
 14 | 
 15 | 
 16 | class TextEncoder(nn.Module):
 17 | 	def __init__(self,
 18 | 			n_vocab,
 19 | 			out_channels,
 20 | 			hidden_channels,
 21 | 			filter_channels,
 22 | 			n_heads,
 23 | 			n_layers,
 24 | 			kernel_size,
 25 | 			p_dropout):
 26 | 		super().__init__()
 27 | 		self.n_vocab = n_vocab
 28 | 		self.out_channels = out_channels
 29 | 		self.hidden_channels = hidden_channels
 30 | 		self.filter_channels = filter_channels
 31 | 		self.n_heads = n_heads
 32 | 		self.n_layers = n_layers
 33 | 		self.kernel_size = kernel_size
 34 | 		self.p_dropout = p_dropout
 35 | 
 36 | 		self.emb = nn.Embedding(n_vocab, hidden_channels)
 37 | 		nn.init.normal_(self.emb.weight, 0.0, hidden_channels**-0.5)
 38 | 
 39 | 		self.encoder = attentions.Encoder(
 40 | 			hidden_channels,
 41 | 			filter_channels,
 42 | 			n_heads,
 43 | 			n_layers,
 44 | 			kernel_size,
 45 | 			p_dropout)
 46 | 		self.proj= nn.Conv1d(hidden_channels, out_channels * 2, 1)
 47 | 
 48 | 	def forward(self, x, x_lengths):
 49 | 		x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h]
 50 | 		x = torch.transpose(x, 1, -1) # [b, h, t]
 51 | 		x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
 52 | 
 53 | 		x = self.encoder(x * x_mask, x_mask)
 54 | 		stats = self.proj(x) * x_mask
 55 | 
 56 | 		m, logs = torch.split(stats, self.out_channels, dim=1)
 57 | 		return x, m, logs, x_mask
 58 |      
 59 | 
 60 | class DurationPredictor(nn.Module):
 61 |     def __init__(
 62 |         self, in_channels, filter_channels, kernel_size, p_dropout, gin_channels=0
 63 |     ):
 64 |         super().__init__()
 65 | 
 66 |         self.in_channels = in_channels
 67 |         self.filter_channels = filter_channels
 68 |         self.kernel_size = kernel_size
 69 |         self.p_dropout = p_dropout
 70 |         self.gin_channels = gin_channels
 71 | 
 72 |         self.drop = nn.Dropout(p_dropout)
 73 |         self.conv_1 = nn.Conv1d(
 74 |             in_channels, filter_channels, kernel_size, padding=kernel_size // 2
 75 |         )
 76 |         self.norm_1 = modules.LayerNorm(filter_channels)
 77 |         self.conv_2 = nn.Conv1d(
 78 |             filter_channels, filter_channels, kernel_size, padding=kernel_size // 2
 79 |         )
 80 |         self.norm_2 = modules.LayerNorm(filter_channels)
 81 |         self.proj = nn.Conv1d(filter_channels, 1, 1)
 82 | 
 83 |         if gin_channels != 0:
 84 |             self.cond = nn.Conv1d(gin_channels, in_channels, 1)
 85 | 
 86 |     def forward(self, x, x_mask, g=None):
 87 |         x = torch.detach(x)
 88 |         if g is not None:
 89 |             g = torch.detach(g)
 90 |             x = x + self.cond(g)
 91 |         x = self.conv_1(x * x_mask)
 92 |         x = torch.relu(x)
 93 |         x = self.norm_1(x)
 94 |         x = self.drop(x)
 95 |         x = self.conv_2(x * x_mask)
 96 |         x = torch.relu(x)
 97 |         x = self.norm_2(x)
 98 |         x = self.drop(x)
 99 |         x = self.proj(x * x_mask)
100 |         return x * x_mask
101 |                
102 | class StochasticDurationPredictor(nn.Module):
103 | 	def __init__(self, in_channels, filter_channels, kernel_size, p_dropout, n_flows=4, gin_channels=0):
104 | 		super().__init__()
105 | 		filter_channels = in_channels # it needs to be removed from future version.
106 | 		self.in_channels = in_channels
107 | 		self.filter_channels = filter_channels
108 | 		self.kernel_size = kernel_size
109 | 		self.p_dropout = p_dropout
110 | 		self.n_flows = n_flows
111 | 		self.gin_channels = gin_channels
112 | 
113 | 		self.log_flow = modules.Log()
114 | 		self.flows = nn.ModuleList()
115 | 		self.flows.append(modules.ElementwiseAffine(2))
116 | 		for i in range(n_flows):
117 | 			self.flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
118 | 			self.flows.append(modules.Flip())
119 | 
120 | 		self.post_pre = nn.Conv1d(1, filter_channels, 1)
121 | 		self.post_proj = nn.Conv1d(filter_channels, filter_channels, 1)
122 | 		self.post_convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
123 | 		self.post_flows = nn.ModuleList()
124 | 		self.post_flows.append(modules.ElementwiseAffine(2))
125 | 		for i in range(4):
126 | 			self.post_flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
127 | 			self.post_flows.append(modules.Flip())
128 | 
129 | 		self.pre = nn.Conv1d(in_channels, filter_channels, 1)
130 | 		self.proj = nn.Conv1d(filter_channels, filter_channels, 1)
131 | 		self.convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
132 | 		if gin_channels != 0:
133 | 			self.cond = nn.Conv1d(gin_channels, filter_channels, 1)
134 | 
135 | 	def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
136 | 		x = torch.detach(x)
137 | 		x = self.pre(x)
138 | 		if g is not None:
139 | 			g = torch.detach(g)
140 | 			x = x + self.cond(g)
141 | 		x = self.convs(x, x_mask)
142 | 		x = self.proj(x) * x_mask
143 | 
144 | 		if not reverse:
145 | 			flows = self.flows
146 | 			assert w is not None
147 | 
148 | 			logdet_tot_q = 0
149 | 			h_w = self.post_pre(w)
150 | 			h_w = self.post_convs(h_w, x_mask)
151 | 			h_w = self.post_proj(h_w) * x_mask
152 | 			e_q = torch.randn(w.size(0), 2, w.size(2)).to(device=x.device, dtype=x.dtype) * x_mask
153 | 			z_q = e_q
154 | 			for flow in self.post_flows:
155 | 				z_q, logdet_q = flow(z_q, x_mask, g=(x + h_w))
156 | 				logdet_tot_q += logdet_q
157 | 			z_u, z1 = torch.split(z_q, [1, 1], 1)
158 | 			u = torch.sigmoid(z_u) * x_mask
159 | 			z0 = (w - u) * x_mask
160 | 			logdet_tot_q += torch.sum((F.logsigmoid(z_u) + F.logsigmoid(-z_u)) * x_mask, [1,2])
161 | 			logq = torch.sum(-0.5 * (math.log(2*math.pi) + (e_q**2)) * x_mask, [1,2]) - logdet_tot_q
162 | 
163 | 			logdet_tot = 0
164 | 			z0, logdet = self.log_flow(z0, x_mask)
165 | 			logdet_tot += logdet
166 | 			z = torch.cat([z0, z1], 1)
167 | 			for flow in flows:
168 | 				z, logdet = flow(z, x_mask, g=x, reverse=reverse)
169 | 				logdet_tot = logdet_tot + logdet
170 | 			nll = torch.sum(0.5 * (math.log(2*math.pi) + (z**2)) * x_mask, [1,2]) - logdet_tot
171 | 			return nll + logq # [b]
172 | 		else:
173 | 			flows = list(reversed(self.flows))
174 | 			flows = flows[:-2] + [flows[-1]] # remove a useless vflow
175 | 			z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) * noise_scale
176 | 			for flow in flows:
177 | 				z = flow(z, x_mask, g=x, reverse=reverse)
178 | 			z0, z1 = torch.split(z, [1, 1], 1)
179 | 			logw = z0
180 | 			return logw
181 | 
182 | class PosteriorEncoder(nn.Module):
183 |     def __init__(
184 |         self,
185 |         in_channels,
186 |         out_channels,
187 |         hidden_channels,
188 |         kernel_size,
189 |         dilation_rate,
190 |         n_layers,
191 |         gin_channels=0,
192 |     ):
193 |         super().__init__()
194 |         self.in_channels = in_channels
195 |         self.out_channels = out_channels
196 |         self.hidden_channels = hidden_channels
197 |         self.kernel_size = kernel_size
198 |         self.dilation_rate = dilation_rate
199 |         self.n_layers = n_layers
200 |         self.gin_channels = gin_channels
201 | 
202 |         self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
203 |         self.enc = modules.WN(
204 |             hidden_channels,
205 |             kernel_size,
206 |             dilation_rate,
207 |             n_layers,
208 |             gin_channels=gin_channels,
209 |         )
210 |         self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
211 | 
212 |     def forward(self, x, x_lengths, g=None, tau=1.0):
213 |         x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(
214 |             x.dtype
215 |         )
216 |         x = self.pre(x) * x_mask
217 |         x = self.enc(x, x_mask, g=g)
218 |         stats = self.proj(x) * x_mask
219 |         m, logs = torch.split(stats, self.out_channels, dim=1)
220 |         z = (m + torch.randn_like(m) * tau * torch.exp(logs)) * x_mask
221 |         return z, m, logs, x_mask
222 | 
223 | 
224 | class Generator(torch.nn.Module):
225 |     def __init__(
226 |         self,
227 |         initial_channel,
228 |         resblock,
229 |         resblock_kernel_sizes,
230 |         resblock_dilation_sizes,
231 |         upsample_rates,
232 |         upsample_initial_channel,
233 |         upsample_kernel_sizes,
234 |         gin_channels=0,
235 |     ):
236 |         super(Generator, self).__init__()
237 |         self.num_kernels = len(resblock_kernel_sizes)
238 |         self.num_upsamples = len(upsample_rates)
239 |         self.conv_pre = Conv1d(
240 |             initial_channel, upsample_initial_channel, 7, 1, padding=3
241 |         )
242 |         resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
243 | 
244 |         self.ups = nn.ModuleList()
245 |         for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
246 |             self.ups.append(
247 |                 weight_norm(
248 |                     ConvTranspose1d(
249 |                         upsample_initial_channel // (2**i),
250 |                         upsample_initial_channel // (2 ** (i + 1)),
251 |                         k,
252 |                         u,
253 |                         padding=(k - u) // 2,
254 |                     )
255 |                 )
256 |             )
257 | 
258 |         self.resblocks = nn.ModuleList()
259 |         for i in range(len(self.ups)):
260 |             ch = upsample_initial_channel // (2 ** (i + 1))
261 |             for j, (k, d) in enumerate(
262 |                 zip(resblock_kernel_sizes, resblock_dilation_sizes)
263 |             ):
264 |                 self.resblocks.append(resblock(ch, k, d))
265 | 
266 |         self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
267 |         self.ups.apply(init_weights)
268 | 
269 |         if gin_channels != 0:
270 |             self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
271 | 
272 |     def forward(self, x, g=None):
273 |         x = self.conv_pre(x)
274 |         if g is not None:
275 |             x = x + self.cond(g)
276 | 
277 |         for i in range(self.num_upsamples):
278 |             x = F.leaky_relu(x, modules.LRELU_SLOPE)
279 |             x = self.ups[i](x)
280 |             xs = None
281 |             for j in range(self.num_kernels):
282 |                 if xs is None:
283 |                     xs = self.resblocks[i * self.num_kernels + j](x)
284 |                 else:
285 |                     xs += self.resblocks[i * self.num_kernels + j](x)
286 |             x = xs / self.num_kernels
287 |         x = F.leaky_relu(x)
288 |         x = self.conv_post(x)
289 |         x = torch.tanh(x)
290 | 
291 |         return x
292 | 
293 |     def remove_weight_norm(self):
294 |         print("Removing weight norm...")
295 |         for layer in self.ups:
296 |             remove_weight_norm(layer)
297 |         for layer in self.resblocks:
298 |             layer.remove_weight_norm()
299 | 
300 | 
301 | class ReferenceEncoder(nn.Module):
302 |     """
303 |     inputs --- [N, Ty/r, n_mels*r]  mels
304 |     outputs --- [N, ref_enc_gru_size]
305 |     """
306 | 
307 |     def __init__(self, spec_channels, gin_channels=0, layernorm=True):
308 |         super().__init__()
309 |         self.spec_channels = spec_channels
310 |         ref_enc_filters = [32, 32, 64, 64, 128, 128]
311 |         K = len(ref_enc_filters)
312 |         filters = [1] + ref_enc_filters
313 |         convs = [
314 |             weight_norm(
315 |                 nn.Conv2d(
316 |                     in_channels=filters[i],
317 |                     out_channels=filters[i + 1],
318 |                     kernel_size=(3, 3),
319 |                     stride=(2, 2),
320 |                     padding=(1, 1),
321 |                 )
322 |             )
323 |             for i in range(K)
324 |         ]
325 |         self.convs = nn.ModuleList(convs)
326 | 
327 |         out_channels = self.calculate_channels(spec_channels, 3, 2, 1, K)
328 |         self.gru = nn.GRU(
329 |             input_size=ref_enc_filters[-1] * out_channels,
330 |             hidden_size=256 // 2,
331 |             batch_first=True,
332 |         )
333 |         self.proj = nn.Linear(128, gin_channels)
334 |         if layernorm:
335 |             self.layernorm = nn.LayerNorm(self.spec_channels)
336 |         else:
337 |             self.layernorm = None
338 | 
339 |     def forward(self, inputs, mask=None):
340 |         N = inputs.size(0)
341 | 
342 |         out = inputs.view(N, 1, -1, self.spec_channels)  # [N, 1, Ty, n_freqs]
343 |         if self.layernorm is not None:
344 |             out = self.layernorm(out)
345 | 
346 |         for conv in self.convs:
347 |             out = conv(out)
348 |             # out = wn(out)
349 |             out = F.relu(out)  # [N, 128, Ty//2^K, n_mels//2^K]
350 | 
351 |         out = out.transpose(1, 2)  # [N, Ty//2^K, 128, n_mels//2^K]
352 |         T = out.size(1)
353 |         N = out.size(0)
354 |         out = out.contiguous().view(N, T, -1)  # [N, Ty//2^K, 128*n_mels//2^K]
355 | 
356 |         self.gru.flatten_parameters()
357 |         memory, out = self.gru(out)  # out --- [1, N, 128]
358 | 
359 |         return self.proj(out.squeeze(0))
360 | 
361 |     def calculate_channels(self, L, kernel_size, stride, pad, n_convs):
362 |         for i in range(n_convs):
363 |             L = (L - kernel_size + 2 * pad) // stride + 1
364 |         return L
365 | 
366 | 
367 | class ResidualCouplingBlock(nn.Module):
368 |     def __init__(self,
369 |             channels,
370 |             hidden_channels,
371 |             kernel_size,
372 |             dilation_rate,
373 |             n_layers,
374 |             n_flows=4,
375 |             gin_channels=0):
376 |         super().__init__()
377 |         self.channels = channels
378 |         self.hidden_channels = hidden_channels
379 |         self.kernel_size = kernel_size
380 |         self.dilation_rate = dilation_rate
381 |         self.n_layers = n_layers
382 |         self.n_flows = n_flows
383 |         self.gin_channels = gin_channels
384 | 
385 |         self.flows = nn.ModuleList()
386 |         for i in range(n_flows):
387 |             self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
388 |             self.flows.append(modules.Flip())
389 | 
390 |     def forward(self, x, x_mask, g=None, reverse=False):
391 |         if not reverse:
392 |             for flow in self.flows:
393 |                 x, _ = flow(x, x_mask, g=g, reverse=reverse)
394 |         else:
395 |             for flow in reversed(self.flows):
396 |                 x = flow(x, x_mask, g=g, reverse=reverse)
397 |         return x
398 | 
399 | class SynthesizerTrn(nn.Module):
400 |     """
401 |     Synthesizer for Training
402 |     """
403 | 
404 |     def __init__(
405 |         self,
406 |         n_vocab,
407 |         spec_channels,
408 |         inter_channels,
409 |         hidden_channels,
410 |         filter_channels,
411 |         n_heads,
412 |         n_layers,
413 |         kernel_size,
414 |         p_dropout,
415 |         resblock,
416 |         resblock_kernel_sizes,
417 |         resblock_dilation_sizes,
418 |         upsample_rates,
419 |         upsample_initial_channel,
420 |         upsample_kernel_sizes,
421 |         n_speakers=256,
422 |         gin_channels=256,
423 |         zero_g=False,
424 |         **kwargs
425 |     ):
426 |         super().__init__()
427 | 
428 |         self.dec = Generator(
429 |             inter_channels,
430 |             resblock,
431 |             resblock_kernel_sizes,
432 |             resblock_dilation_sizes,
433 |             upsample_rates,
434 |             upsample_initial_channel,
435 |             upsample_kernel_sizes,
436 |             gin_channels=gin_channels,
437 |         )
438 |         self.enc_q = PosteriorEncoder(
439 |             spec_channels,
440 |             inter_channels,
441 |             hidden_channels,
442 |             5,
443 |             1,
444 |             16,
445 |             gin_channels=gin_channels,
446 |         )
447 | 
448 |         self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
449 | 
450 |         self.n_speakers = n_speakers
451 |         if n_speakers == 0:
452 |             self.ref_enc = ReferenceEncoder(spec_channels, gin_channels)
453 |         else:
454 |             self.enc_p = TextEncoder(n_vocab,
455 |                 inter_channels,
456 |                 hidden_channels,
457 |                 filter_channels,
458 |                 n_heads,
459 |                 n_layers,
460 |                 kernel_size,
461 |                 p_dropout)
462 |             self.sdp = StochasticDurationPredictor(hidden_channels, 192, 3, 0.5, 4, gin_channels=gin_channels)
463 |             self.dp = DurationPredictor(hidden_channels, 256, 3, 0.5, gin_channels=gin_channels)
464 |             self.emb_g = nn.Embedding(n_speakers, gin_channels)
465 |         self.zero_g = zero_g
466 | 
467 |     def infer(self, x, x_lengths, sid=None, noise_scale=1, length_scale=1, noise_scale_w=1., sdp_ratio=0.2, max_len=None):
468 |         x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths)
469 |         if self.n_speakers > 0:
470 |             g = self.emb_g(sid).unsqueeze(-1) # [b, h, 1]
471 |         else:
472 |             g = None
473 | 
474 |         logw = self.sdp(x, x_mask, g=g, reverse=True, noise_scale=noise_scale_w) * sdp_ratio \
475 |             + self.dp(x, x_mask, g=g) * (1 - sdp_ratio)
476 | 
477 |         w = torch.exp(logw) * x_mask * length_scale
478 |         w_ceil = torch.ceil(w)
479 |         y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
480 |         y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
481 |         attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
482 |         attn = commons.generate_path(w_ceil, attn_mask)
483 | 
484 |         m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
485 |         logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
486 | 
487 |         z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
488 |         z = self.flow(z_p, y_mask, g=g, reverse=True)
489 |         o = self.dec((z * y_mask)[:,:,:max_len], g=g)
490 |         return o, attn, y_mask, (z, z_p, m_p, logs_p)
491 | 
492 |     def voice_conversion(self, y, y_lengths, sid_src, sid_tgt, tau=1.0):
493 |         g_src = sid_src
494 |         g_tgt = sid_tgt
495 |         z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g_src if not self.zero_g else torch.zeros_like(g_src), tau=tau)
496 |         z_p = self.flow(z, y_mask, g=g_src)
497 |         z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
498 |         o_hat = self.dec(z_hat * y_mask, g=g_tgt if not self.zero_g else torch.zeros_like(g_tgt))
499 |         return o_hat, y_mask, (z, z_p, z_hat)
500 | 


--------------------------------------------------------------------------------
/openvoice/modules.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | from torch import nn
  4 | from torch.nn import functional as F
  5 | 
  6 | from torch.nn import Conv1d
  7 | from torch.nn.utils import weight_norm, remove_weight_norm
  8 | 
  9 | from openvoice import commons
 10 | from openvoice.commons import init_weights, get_padding
 11 | from openvoice.transforms import piecewise_rational_quadratic_transform
 12 | from openvoice.attentions import Encoder
 13 | 
 14 | LRELU_SLOPE = 0.1
 15 | 
 16 | 
 17 | class LayerNorm(nn.Module):
 18 |     def __init__(self, channels, eps=1e-5):
 19 |         super().__init__()
 20 |         self.channels = channels
 21 |         self.eps = eps
 22 | 
 23 |         self.gamma = nn.Parameter(torch.ones(channels))
 24 |         self.beta = nn.Parameter(torch.zeros(channels))
 25 | 
 26 |     def forward(self, x):
 27 |         x = x.transpose(1, -1)
 28 |         x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
 29 |         return x.transpose(1, -1)
 30 | 
 31 | 
 32 | class ConvReluNorm(nn.Module):
 33 |     def __init__(
 34 |         self,
 35 |         in_channels,
 36 |         hidden_channels,
 37 |         out_channels,
 38 |         kernel_size,
 39 |         n_layers,
 40 |         p_dropout,
 41 |     ):
 42 |         super().__init__()
 43 |         self.in_channels = in_channels
 44 |         self.hidden_channels = hidden_channels
 45 |         self.out_channels = out_channels
 46 |         self.kernel_size = kernel_size
 47 |         self.n_layers = n_layers
 48 |         self.p_dropout = p_dropout
 49 |         assert n_layers > 1, "Number of layers should be larger than 0."
 50 | 
 51 |         self.conv_layers = nn.ModuleList()
 52 |         self.norm_layers = nn.ModuleList()
 53 |         self.conv_layers.append(
 54 |             nn.Conv1d(
 55 |                 in_channels, hidden_channels, kernel_size, padding=kernel_size // 2
 56 |             )
 57 |         )
 58 |         self.norm_layers.append(LayerNorm(hidden_channels))
 59 |         self.relu_drop = nn.Sequential(nn.ReLU(), nn.Dropout(p_dropout))
 60 |         for _ in range(n_layers - 1):
 61 |             self.conv_layers.append(
 62 |                 nn.Conv1d(
 63 |                     hidden_channels,
 64 |                     hidden_channels,
 65 |                     kernel_size,
 66 |                     padding=kernel_size // 2,
 67 |                 )
 68 |             )
 69 |             self.norm_layers.append(LayerNorm(hidden_channels))
 70 |         self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
 71 |         self.proj.weight.data.zero_()
 72 |         self.proj.bias.data.zero_()
 73 | 
 74 |     def forward(self, x, x_mask):
 75 |         x_org = x
 76 |         for i in range(self.n_layers):
 77 |             x = self.conv_layers[i](x * x_mask)
 78 |             x = self.norm_layers[i](x)
 79 |             x = self.relu_drop(x)
 80 |         x = x_org + self.proj(x)
 81 |         return x * x_mask
 82 | 
 83 | 
 84 | class DDSConv(nn.Module):
 85 |     """
 86 |     Dilated and Depth-Separable Convolution
 87 |     """
 88 | 
 89 |     def __init__(self, channels, kernel_size, n_layers, p_dropout=0.0):
 90 |         super().__init__()
 91 |         self.channels = channels
 92 |         self.kernel_size = kernel_size
 93 |         self.n_layers = n_layers
 94 |         self.p_dropout = p_dropout
 95 | 
 96 |         self.drop = nn.Dropout(p_dropout)
 97 |         self.convs_sep = nn.ModuleList()
 98 |         self.convs_1x1 = nn.ModuleList()
 99 |         self.norms_1 = nn.ModuleList()
100 |         self.norms_2 = nn.ModuleList()
101 |         for i in range(n_layers):
102 |             dilation = kernel_size**i
103 |             padding = (kernel_size * dilation - dilation) // 2
104 |             self.convs_sep.append(
105 |                 nn.Conv1d(
106 |                     channels,
107 |                     channels,
108 |                     kernel_size,
109 |                     groups=channels,
110 |                     dilation=dilation,
111 |                     padding=padding,
112 |                 )
113 |             )
114 |             self.convs_1x1.append(nn.Conv1d(channels, channels, 1))
115 |             self.norms_1.append(LayerNorm(channels))
116 |             self.norms_2.append(LayerNorm(channels))
117 | 
118 |     def forward(self, x, x_mask, g=None):
119 |         if g is not None:
120 |             x = x + g
121 |         for i in range(self.n_layers):
122 |             y = self.convs_sep[i](x * x_mask)
123 |             y = self.norms_1[i](y)
124 |             y = F.gelu(y)
125 |             y = self.convs_1x1[i](y)
126 |             y = self.norms_2[i](y)
127 |             y = F.gelu(y)
128 |             y = self.drop(y)
129 |             x = x + y
130 |         return x * x_mask
131 | 
132 | 
133 | class WN(torch.nn.Module):
134 |     def __init__(
135 |         self,
136 |         hidden_channels,
137 |         kernel_size,
138 |         dilation_rate,
139 |         n_layers,
140 |         gin_channels=0,
141 |         p_dropout=0,
142 |     ):
143 |         super(WN, self).__init__()
144 |         assert kernel_size % 2 == 1
145 |         self.hidden_channels = hidden_channels
146 |         self.kernel_size = (kernel_size,)
147 |         self.dilation_rate = dilation_rate
148 |         self.n_layers = n_layers
149 |         self.gin_channels = gin_channels
150 |         self.p_dropout = p_dropout
151 | 
152 |         self.in_layers = torch.nn.ModuleList()
153 |         self.res_skip_layers = torch.nn.ModuleList()
154 |         self.drop = nn.Dropout(p_dropout)
155 | 
156 |         if gin_channels != 0:
157 |             cond_layer = torch.nn.Conv1d(
158 |                 gin_channels, 2 * hidden_channels * n_layers, 1
159 |             )
160 |             self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name="weight")
161 | 
162 |         for i in range(n_layers):
163 |             dilation = dilation_rate**i
164 |             padding = int((kernel_size * dilation - dilation) / 2)
165 |             in_layer = torch.nn.Conv1d(
166 |                 hidden_channels,
167 |                 2 * hidden_channels,
168 |                 kernel_size,
169 |                 dilation=dilation,
170 |                 padding=padding,
171 |             )
172 |             in_layer = torch.nn.utils.weight_norm(in_layer, name="weight")
173 |             self.in_layers.append(in_layer)
174 | 
175 |             # last one is not necessary
176 |             if i < n_layers - 1:
177 |                 res_skip_channels = 2 * hidden_channels
178 |             else:
179 |                 res_skip_channels = hidden_channels
180 | 
181 |             res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1)
182 |             res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name="weight")
183 |             self.res_skip_layers.append(res_skip_layer)
184 | 
185 |     def forward(self, x, x_mask, g=None, **kwargs):
186 |         output = torch.zeros_like(x)
187 |         n_channels_tensor = torch.IntTensor([self.hidden_channels])
188 | 
189 |         if g is not None:
190 |             g = self.cond_layer(g)
191 | 
192 |         for i in range(self.n_layers):
193 |             x_in = self.in_layers[i](x)
194 |             if g is not None:
195 |                 cond_offset = i * 2 * self.hidden_channels
196 |                 g_l = g[:, cond_offset : cond_offset + 2 * self.hidden_channels, :]
197 |             else:
198 |                 g_l = torch.zeros_like(x_in)
199 | 
200 |             acts = commons.fused_add_tanh_sigmoid_multiply(x_in, g_l, n_channels_tensor)
201 |             acts = self.drop(acts)
202 | 
203 |             res_skip_acts = self.res_skip_layers[i](acts)
204 |             if i < self.n_layers - 1:
205 |                 res_acts = res_skip_acts[:, : self.hidden_channels, :]
206 |                 x = (x + res_acts) * x_mask
207 |                 output = output + res_skip_acts[:, self.hidden_channels :, :]
208 |             else:
209 |                 output = output + res_skip_acts
210 |         return output * x_mask
211 | 
212 |     def remove_weight_norm(self):
213 |         if self.gin_channels != 0:
214 |             torch.nn.utils.remove_weight_norm(self.cond_layer)
215 |         for l in self.in_layers:
216 |             torch.nn.utils.remove_weight_norm(l)
217 |         for l in self.res_skip_layers:
218 |             torch.nn.utils.remove_weight_norm(l)
219 | 
220 | 
221 | class ResBlock1(torch.nn.Module):
222 |     def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
223 |         super(ResBlock1, self).__init__()
224 |         self.convs1 = nn.ModuleList(
225 |             [
226 |                 weight_norm(
227 |                     Conv1d(
228 |                         channels,
229 |                         channels,
230 |                         kernel_size,
231 |                         1,
232 |                         dilation=dilation[0],
233 |                         padding=get_padding(kernel_size, dilation[0]),
234 |                     )
235 |                 ),
236 |                 weight_norm(
237 |                     Conv1d(
238 |                         channels,
239 |                         channels,
240 |                         kernel_size,
241 |                         1,
242 |                         dilation=dilation[1],
243 |                         padding=get_padding(kernel_size, dilation[1]),
244 |                     )
245 |                 ),
246 |                 weight_norm(
247 |                     Conv1d(
248 |                         channels,
249 |                         channels,
250 |                         kernel_size,
251 |                         1,
252 |                         dilation=dilation[2],
253 |                         padding=get_padding(kernel_size, dilation[2]),
254 |                     )
255 |                 ),
256 |             ]
257 |         )
258 |         self.convs1.apply(init_weights)
259 | 
260 |         self.convs2 = nn.ModuleList(
261 |             [
262 |                 weight_norm(
263 |                     Conv1d(
264 |                         channels,
265 |                         channels,
266 |                         kernel_size,
267 |                         1,
268 |                         dilation=1,
269 |                         padding=get_padding(kernel_size, 1),
270 |                     )
271 |                 ),
272 |                 weight_norm(
273 |                     Conv1d(
274 |                         channels,
275 |                         channels,
276 |                         kernel_size,
277 |                         1,
278 |                         dilation=1,
279 |                         padding=get_padding(kernel_size, 1),
280 |                     )
281 |                 ),
282 |                 weight_norm(
283 |                     Conv1d(
284 |                         channels,
285 |                         channels,
286 |                         kernel_size,
287 |                         1,
288 |                         dilation=1,
289 |                         padding=get_padding(kernel_size, 1),
290 |                     )
291 |                 ),
292 |             ]
293 |         )
294 |         self.convs2.apply(init_weights)
295 | 
296 |     def forward(self, x, x_mask=None):
297 |         for c1, c2 in zip(self.convs1, self.convs2):
298 |             xt = F.leaky_relu(x, LRELU_SLOPE)
299 |             if x_mask is not None:
300 |                 xt = xt * x_mask
301 |             xt = c1(xt)
302 |             xt = F.leaky_relu(xt, LRELU_SLOPE)
303 |             if x_mask is not None:
304 |                 xt = xt * x_mask
305 |             xt = c2(xt)
306 |             x = xt + x
307 |         if x_mask is not None:
308 |             x = x * x_mask
309 |         return x
310 | 
311 |     def remove_weight_norm(self):
312 |         for l in self.convs1:
313 |             remove_weight_norm(l)
314 |         for l in self.convs2:
315 |             remove_weight_norm(l)
316 | 
317 | 
318 | class ResBlock2(torch.nn.Module):
319 |     def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
320 |         super(ResBlock2, self).__init__()
321 |         self.convs = nn.ModuleList(
322 |             [
323 |                 weight_norm(
324 |                     Conv1d(
325 |                         channels,
326 |                         channels,
327 |                         kernel_size,
328 |                         1,
329 |                         dilation=dilation[0],
330 |                         padding=get_padding(kernel_size, dilation[0]),
331 |                     )
332 |                 ),
333 |                 weight_norm(
334 |                     Conv1d(
335 |                         channels,
336 |                         channels,
337 |                         kernel_size,
338 |                         1,
339 |                         dilation=dilation[1],
340 |                         padding=get_padding(kernel_size, dilation[1]),
341 |                     )
342 |                 ),
343 |             ]
344 |         )
345 |         self.convs.apply(init_weights)
346 | 
347 |     def forward(self, x, x_mask=None):
348 |         for c in self.convs:
349 |             xt = F.leaky_relu(x, LRELU_SLOPE)
350 |             if x_mask is not None:
351 |                 xt = xt * x_mask
352 |             xt = c(xt)
353 |             x = xt + x
354 |         if x_mask is not None:
355 |             x = x * x_mask
356 |         return x
357 | 
358 |     def remove_weight_norm(self):
359 |         for l in self.convs:
360 |             remove_weight_norm(l)
361 | 
362 | 
363 | class Log(nn.Module):
364 |     def forward(self, x, x_mask, reverse=False, **kwargs):
365 |         if not reverse:
366 |             y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask
367 |             logdet = torch.sum(-y, [1, 2])
368 |             return y, logdet
369 |         else:
370 |             x = torch.exp(x) * x_mask
371 |             return x
372 | 
373 | 
374 | class Flip(nn.Module):
375 |     def forward(self, x, *args, reverse=False, **kwargs):
376 |         x = torch.flip(x, [1])
377 |         if not reverse:
378 |             logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device)
379 |             return x, logdet
380 |         else:
381 |             return x
382 | 
383 | 
384 | class ElementwiseAffine(nn.Module):
385 |     def __init__(self, channels):
386 |         super().__init__()
387 |         self.channels = channels
388 |         self.m = nn.Parameter(torch.zeros(channels, 1))
389 |         self.logs = nn.Parameter(torch.zeros(channels, 1))
390 | 
391 |     def forward(self, x, x_mask, reverse=False, **kwargs):
392 |         if not reverse:
393 |             y = self.m + torch.exp(self.logs) * x
394 |             y = y * x_mask
395 |             logdet = torch.sum(self.logs * x_mask, [1, 2])
396 |             return y, logdet
397 |         else:
398 |             x = (x - self.m) * torch.exp(-self.logs) * x_mask
399 |             return x
400 | 
401 | 
402 | class ResidualCouplingLayer(nn.Module):
403 |     def __init__(
404 |         self,
405 |         channels,
406 |         hidden_channels,
407 |         kernel_size,
408 |         dilation_rate,
409 |         n_layers,
410 |         p_dropout=0,
411 |         gin_channels=0,
412 |         mean_only=False,
413 |     ):
414 |         assert channels % 2 == 0, "channels should be divisible by 2"
415 |         super().__init__()
416 |         self.channels = channels
417 |         self.hidden_channels = hidden_channels
418 |         self.kernel_size = kernel_size
419 |         self.dilation_rate = dilation_rate
420 |         self.n_layers = n_layers
421 |         self.half_channels = channels // 2
422 |         self.mean_only = mean_only
423 | 
424 |         self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
425 |         self.enc = WN(
426 |             hidden_channels,
427 |             kernel_size,
428 |             dilation_rate,
429 |             n_layers,
430 |             p_dropout=p_dropout,
431 |             gin_channels=gin_channels,
432 |         )
433 |         self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
434 |         self.post.weight.data.zero_()
435 |         self.post.bias.data.zero_()
436 | 
437 |     def forward(self, x, x_mask, g=None, reverse=False):
438 |         x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
439 |         h = self.pre(x0) * x_mask
440 |         h = self.enc(h, x_mask, g=g)
441 |         stats = self.post(h) * x_mask
442 |         if not self.mean_only:
443 |             m, logs = torch.split(stats, [self.half_channels] * 2, 1)
444 |         else:
445 |             m = stats
446 |             logs = torch.zeros_like(m)
447 | 
448 |         if not reverse:
449 |             x1 = m + x1 * torch.exp(logs) * x_mask
450 |             x = torch.cat([x0, x1], 1)
451 |             logdet = torch.sum(logs, [1, 2])
452 |             return x, logdet
453 |         else:
454 |             x1 = (x1 - m) * torch.exp(-logs) * x_mask
455 |             x = torch.cat([x0, x1], 1)
456 |             return x
457 | 
458 | 
459 | class ConvFlow(nn.Module):
460 |     def __init__(
461 |         self,
462 |         in_channels,
463 |         filter_channels,
464 |         kernel_size,
465 |         n_layers,
466 |         num_bins=10,
467 |         tail_bound=5.0,
468 |     ):
469 |         super().__init__()
470 |         self.in_channels = in_channels
471 |         self.filter_channels = filter_channels
472 |         self.kernel_size = kernel_size
473 |         self.n_layers = n_layers
474 |         self.num_bins = num_bins
475 |         self.tail_bound = tail_bound
476 |         self.half_channels = in_channels // 2
477 | 
478 |         self.pre = nn.Conv1d(self.half_channels, filter_channels, 1)
479 |         self.convs = DDSConv(filter_channels, kernel_size, n_layers, p_dropout=0.0)
480 |         self.proj = nn.Conv1d(
481 |             filter_channels, self.half_channels * (num_bins * 3 - 1), 1
482 |         )
483 |         self.proj.weight.data.zero_()
484 |         self.proj.bias.data.zero_()
485 | 
486 |     def forward(self, x, x_mask, g=None, reverse=False):
487 |         x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
488 |         h = self.pre(x0)
489 |         h = self.convs(h, x_mask, g=g)
490 |         h = self.proj(h) * x_mask
491 | 
492 |         b, c, t = x0.shape
493 |         h = h.reshape(b, c, -1, t).permute(0, 1, 3, 2)  # [b, cx?, t] -> [b, c, t, ?]
494 | 
495 |         unnormalized_widths = h[..., : self.num_bins] / math.sqrt(self.filter_channels)
496 |         unnormalized_heights = h[..., self.num_bins : 2 * self.num_bins] / math.sqrt(
497 |             self.filter_channels
498 |         )
499 |         unnormalized_derivatives = h[..., 2 * self.num_bins :]
500 | 
501 |         x1, logabsdet = piecewise_rational_quadratic_transform(
502 |             x1,
503 |             unnormalized_widths,
504 |             unnormalized_heights,
505 |             unnormalized_derivatives,
506 |             inverse=reverse,
507 |             tails="linear",
508 |             tail_bound=self.tail_bound,
509 |         )
510 | 
511 |         x = torch.cat([x0, x1], 1) * x_mask
512 |         logdet = torch.sum(logabsdet * x_mask, [1, 2])
513 |         if not reverse:
514 |             return x, logdet
515 |         else:
516 |             return x
517 | 
518 | 
519 | class TransformerCouplingLayer(nn.Module):
520 |     def __init__(
521 |         self,
522 |         channels,
523 |         hidden_channels,
524 |         kernel_size,
525 |         n_layers,
526 |         n_heads,
527 |         p_dropout=0,
528 |         filter_channels=0,
529 |         mean_only=False,
530 |         wn_sharing_parameter=None,
531 |         gin_channels=0,
532 |     ):
533 |         assert n_layers == 3, n_layers
534 |         assert channels % 2 == 0, "channels should be divisible by 2"
535 |         super().__init__()
536 |         self.channels = channels
537 |         self.hidden_channels = hidden_channels
538 |         self.kernel_size = kernel_size
539 |         self.n_layers = n_layers
540 |         self.half_channels = channels // 2
541 |         self.mean_only = mean_only
542 | 
543 |         self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
544 |         self.enc = (
545 |             Encoder(
546 |                 hidden_channels,
547 |                 filter_channels,
548 |                 n_heads,
549 |                 n_layers,
550 |                 kernel_size,
551 |                 p_dropout,
552 |                 isflow=True,
553 |                 gin_channels=gin_channels,
554 |             )
555 |             if wn_sharing_parameter is None
556 |             else wn_sharing_parameter
557 |         )
558 |         self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
559 |         self.post.weight.data.zero_()
560 |         self.post.bias.data.zero_()
561 | 
562 |     def forward(self, x, x_mask, g=None, reverse=False):
563 |         x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
564 |         h = self.pre(x0) * x_mask
565 |         h = self.enc(h, x_mask, g=g)
566 |         stats = self.post(h) * x_mask
567 |         if not self.mean_only:
568 |             m, logs = torch.split(stats, [self.half_channels] * 2, 1)
569 |         else:
570 |             m = stats
571 |             logs = torch.zeros_like(m)
572 | 
573 |         if not reverse:
574 |             x1 = m + x1 * torch.exp(logs) * x_mask
575 |             x = torch.cat([x0, x1], 1)
576 |             logdet = torch.sum(logs, [1, 2])
577 |             return x, logdet
578 |         else:
579 |             x1 = (x1 - m) * torch.exp(-logs) * x_mask
580 |             x = torch.cat([x0, x1], 1)
581 |             return x
582 | 
583 |         x1, logabsdet = piecewise_rational_quadratic_transform(
584 |             x1,
585 |             unnormalized_widths,
586 |             unnormalized_heights,
587 |             unnormalized_derivatives,
588 |             inverse=reverse,
589 |             tails="linear",
590 |             tail_bound=self.tail_bound,
591 |         )
592 | 
593 |         x = torch.cat([x0, x1], 1) * x_mask
594 |         logdet = torch.sum(logabsdet * x_mask, [1, 2])
595 |         if not reverse:
596 |             return x, logdet
597 |         else:
598 |             return x
599 | 


--------------------------------------------------------------------------------
/openvoice/openvoice_app.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import torch
  3 | import argparse
  4 | import gradio as gr
  5 | from zipfile import ZipFile
  6 | import langid
  7 | from openvoice import se_extractor
  8 | from openvoice.api import BaseSpeakerTTS, ToneColorConverter
  9 | 
 10 | parser = argparse.ArgumentParser()
 11 | parser.add_argument("--share", action='store_true', default=False, help="make link public")
 12 | args = parser.parse_args()
 13 | 
 14 | en_ckpt_base = 'checkpoints/base_speakers/EN'
 15 | zh_ckpt_base = 'checkpoints/base_speakers/ZH'
 16 | ckpt_converter = 'checkpoints/converter'
 17 | device = 'cuda' if torch.cuda.is_available() else 'cpu'
 18 | output_dir = 'outputs'
 19 | os.makedirs(output_dir, exist_ok=True)
 20 | 
 21 | # load models
 22 | en_base_speaker_tts = BaseSpeakerTTS(f'{en_ckpt_base}/config.json', device=device)
 23 | en_base_speaker_tts.load_ckpt(f'{en_ckpt_base}/checkpoint.pth')
 24 | zh_base_speaker_tts = BaseSpeakerTTS(f'{zh_ckpt_base}/config.json', device=device)
 25 | zh_base_speaker_tts.load_ckpt(f'{zh_ckpt_base}/checkpoint.pth')
 26 | tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
 27 | tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')
 28 | 
 29 | # load speaker embeddings
 30 | en_source_default_se = torch.load(f'{en_ckpt_base}/en_default_se.pth').to(device)
 31 | en_source_style_se = torch.load(f'{en_ckpt_base}/en_style_se.pth').to(device)
 32 | zh_source_se = torch.load(f'{zh_ckpt_base}/zh_default_se.pth').to(device)
 33 | 
 34 | # This online demo mainly supports English and Chinese
 35 | supported_languages = ['zh', 'en']
 36 | 
 37 | def predict(prompt, style, audio_file_pth, agree):
 38 |     # initialize a empty info
 39 |     text_hint = ''
 40 |     # agree with the terms
 41 |     if agree == False:
 42 |         text_hint += '[ERROR] Please accept the Terms & Condition!\n'
 43 |         gr.Warning("Please accept the Terms & Condition!")
 44 |         return (
 45 |             text_hint,
 46 |             None,
 47 |             None,
 48 |         )
 49 | 
 50 |     # first detect the input language
 51 |     language_predicted = langid.classify(prompt)[0].strip()  
 52 |     print(f"Detected language:{language_predicted}")
 53 | 
 54 |     if language_predicted not in supported_languages:
 55 |         text_hint += f"[ERROR] The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}\n"
 56 |         gr.Warning(
 57 |             f"The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}"
 58 |         )
 59 | 
 60 |         return (
 61 |             text_hint,
 62 |             None,
 63 |             None,
 64 |         )
 65 |     
 66 |     if language_predicted == "zh":
 67 |         tts_model = zh_base_speaker_tts
 68 |         source_se = zh_source_se
 69 |         language = 'Chinese'
 70 |         if style not in ['default']:
 71 |             text_hint += f"[ERROR] The style {style} is not supported for Chinese, which should be in ['default']\n"
 72 |             gr.Warning(f"The style {style} is not supported for Chinese, which should be in ['default']")
 73 |             return (
 74 |                 text_hint,
 75 |                 None,
 76 |                 None,
 77 |             )
 78 | 
 79 |     else:
 80 |         tts_model = en_base_speaker_tts
 81 |         if style == 'default':
 82 |             source_se = en_source_default_se
 83 |         else:
 84 |             source_se = en_source_style_se
 85 |         language = 'English'
 86 |         if style not in ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']:
 87 |             text_hint += f"[ERROR] The style {style} is not supported for English, which should be in ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']\n"
 88 |             gr.Warning(f"The style {style} is not supported for English, which should be in ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']")
 89 |             return (
 90 |                 text_hint,
 91 |                 None,
 92 |                 None,
 93 |             )
 94 | 
 95 |     speaker_wav = audio_file_pth
 96 | 
 97 |     if len(prompt) < 2:
 98 |         text_hint += f"[ERROR] Please give a longer prompt text \n"
 99 |         gr.Warning("Please give a longer prompt text")
100 |         return (
101 |             text_hint,
102 |             None,
103 |             None,
104 |         )
105 |     if len(prompt) > 200:
106 |         text_hint += f"[ERROR] Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo and try for your usage \n"
107 |         gr.Warning(
108 |             "Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo for your usage"
109 |         )
110 |         return (
111 |             text_hint,
112 |             None,
113 |             None,
114 |         )
115 |     
116 |     # note diffusion_conditioning not used on hifigan (default mode), it will be empty but need to pass it to model.inference
117 |     try:
118 |         target_se, audio_name = se_extractor.get_se(speaker_wav, tone_color_converter, target_dir='processed', vad=True)
119 |     except Exception as e:
120 |         text_hint += f"[ERROR] Get target tone color error {str(e)} \n"
121 |         gr.Warning(
122 |             "[ERROR] Get target tone color error {str(e)} \n"
123 |         )
124 |         return (
125 |             text_hint,
126 |             None,
127 |             None,
128 |         )
129 | 
130 |     src_path = f'{output_dir}/tmp.wav'
131 |     tts_model.tts(prompt, src_path, speaker=style, language=language)
132 | 
133 |     save_path = f'{output_dir}/output.wav'
134 |     # Run the tone color converter
135 |     encode_message = "@MyShell"
136 |     tone_color_converter.convert(
137 |         audio_src_path=src_path, 
138 |         src_se=source_se, 
139 |         tgt_se=target_se, 
140 |         output_path=save_path,
141 |         message=encode_message)
142 | 
143 |     text_hint += f'''Get response successfully \n'''
144 | 
145 |     return (
146 |         text_hint,
147 |         save_path,
148 |         speaker_wav,
149 |     )
150 | 
151 | 
152 | 
153 | title = "MyShell OpenVoice"
154 | 
155 | description = """
156 | We introduce OpenVoice, a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. OpenVoice also achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set.
157 | """
158 | 
159 | markdown_table = """
160 | <div align="center" style="margin-bottom: 10px;">
161 | 
162 | |               |               |               |
163 | | :-----------: | :-----------: | :-----------: | 
164 | | **OpenSource Repo** | **Project Page** | **Join the Community** |        
165 | | <div style='text-align: center;'><a style="display:inline-block,align:center" href='https://github.com/myshell-ai/OpenVoice'><img src='https://img.shields.io/github/stars/myshell-ai/OpenVoice?style=social' /></a></div> | [OpenVoice](https://research.myshell.ai/open-voice) | [![Discord](https://img.shields.io/discord/1122227993805336617?color=%239B59B6&label=%20Discord%20)](https://discord.gg/myshell) |
166 | 
167 | </div>
168 | """
169 | 
170 | markdown_table_v2 = """
171 | <div align="center" style="margin-bottom: 2px;">
172 | 
173 | |               |               |               |              |
174 | | :-----------: | :-----------: | :-----------: | :-----------: | 
175 | | **OpenSource Repo** | <div style='text-align: center;'><a style="display:inline-block,align:center" href='https://github.com/myshell-ai/OpenVoice'><img src='https://img.shields.io/github/stars/myshell-ai/OpenVoice?style=social' /></a></div> |  **Project Page** |  [OpenVoice](https://research.myshell.ai/open-voice) |     
176 | 
177 | | | |
178 | | :-----------: | :-----------: |
179 | **Join the Community** |   [![Discord](https://img.shields.io/discord/1122227993805336617?color=%239B59B6&label=%20Discord%20)](https://discord.gg/myshell) |
180 | 
181 | </div>
182 | """
183 | content = """
184 | <div>
185 |   <strong>If the generated voice does not sound like the reference voice, please refer to <a href='https://github.com/myshell-ai/OpenVoice/blob/main/docs/QA.md'>this QnA</a>.</strong> <strong>For multi-lingual & cross-lingual examples, please refer to <a href='https://github.com/myshell-ai/OpenVoice/blob/main/demo_part2.ipynb'>this jupyter notebook</a>.</strong>
186 |   This online demo mainly supports <strong>English</strong>. The <em>default</em> style also supports <strong>Chinese</strong>. But OpenVoice can adapt to any other language as long as a base speaker is provided.
187 | </div>
188 | """
189 | wrapped_markdown_content = f"<div style='border: 1px solid #000; padding: 10px;'>{content}</div>"
190 | 
191 | 
192 | examples = [
193 |     [
194 |         "今天天气真好，我们一起出去吃饭吧。",
195 |         'default',
196 |         "resources/demo_speaker1.mp3",
197 |         True,
198 |     ],[
199 |         "This audio is generated by open voice with a half-performance model.",
200 |         'whispering',
201 |         "resources/demo_speaker2.mp3",
202 |         True,
203 |     ],
204 |     [
205 |         "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.",
206 |         'sad',
207 |         "resources/demo_speaker0.mp3",
208 |         True,
209 |     ],
210 | ]
211 | 
212 | with gr.Blocks(analytics_enabled=False) as demo:
213 | 
214 |     with gr.Row():
215 |         with gr.Column():
216 |             with gr.Row():
217 |                 gr.Markdown(
218 |                     """
219 |                     ## <img src="https://huggingface.co/spaces/myshell-ai/OpenVoice/raw/main/logo.jpg" height="40"/>
220 |                     """
221 |                 )
222 |             with gr.Row():    
223 |                 gr.Markdown(markdown_table_v2)
224 |             with gr.Row():
225 |                 gr.Markdown(description)
226 |         with gr.Column():
227 |             gr.Video('https://github.com/myshell-ai/OpenVoice/assets/40556743/3cba936f-82bf-476c-9e52-09f0f417bb2f', autoplay=True)
228 |             
229 |     with gr.Row():
230 |         gr.HTML(wrapped_markdown_content)
231 | 
232 |     with gr.Row():
233 |         with gr.Column():
234 |             input_text_gr = gr.Textbox(
235 |                 label="Text Prompt",
236 |                 info="One or two sentences at a time is better. Up to 200 text characters.",
237 |                 value="He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.",
238 |             )
239 |             style_gr = gr.Dropdown(
240 |                 label="Style",
241 |                 info="Select a style of output audio for the synthesised speech. (Chinese only support 'default' now)",
242 |                 choices=['default', 'whispering', 'cheerful', 'terrified', 'angry', 'sad', 'friendly'],
243 |                 max_choices=1,
244 |                 value="default",
245 |             )
246 |             ref_gr = gr.Audio(
247 |                 label="Reference Audio",
248 |                 info="Click on the ✎ button to upload your own target speaker audio",
249 |                 type="filepath",
250 |                 value="resources/demo_speaker2.mp3",
251 |             )
252 |             tos_gr = gr.Checkbox(
253 |                 label="Agree",
254 |                 value=False,
255 |                 info="I agree to the terms of the cc-by-nc-4.0 license-: https://github.com/myshell-ai/OpenVoice/blob/main/LICENSE",
256 |             )
257 | 
258 |             tts_button = gr.Button("Send", elem_id="send-btn", visible=True)
259 | 
260 | 
261 |         with gr.Column():
262 |             out_text_gr = gr.Text(label="Info")
263 |             audio_gr = gr.Audio(label="Synthesised Audio", autoplay=True)
264 |             ref_audio_gr = gr.Audio(label="Reference Audio Used")
265 | 
266 |             gr.Examples(examples,
267 |                         label="Examples",
268 |                         inputs=[input_text_gr, style_gr, ref_gr, tos_gr],
269 |                         outputs=[out_text_gr, audio_gr, ref_audio_gr],
270 |                         fn=predict,
271 |                         cache_examples=False,)
272 |             tts_button.click(predict, [input_text_gr, style_gr, ref_gr, tos_gr], outputs=[out_text_gr, audio_gr, ref_audio_gr])
273 | 
274 | demo.queue()  
275 | demo.launch(debug=True, show_api=True, share=args.share)
276 | 


--------------------------------------------------------------------------------
/openvoice/openvoice_server.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import torch
  4 | from melo.api import TTS
  5 | import se_extractor
  6 | import io
  7 | import magic
  8 | import logging
  9 | 
 10 | from fastapi import FastAPI, UploadFile, File, HTTPException, Form
 11 | from fastapi.middleware.cors import CORSMiddleware
 12 | from starlette.responses import StreamingResponse
 13 | from typing import Optional
 14 | from pydantic import BaseModel
 15 | from api import ToneColorConverter
 16 | 
 17 | logging.basicConfig(level=logging.INFO)
 18 | 
 19 | app = FastAPI()
 20 | 
 21 | # Add CORS middleware
 22 | app.add_middleware(
 23 |     CORSMiddleware,
 24 |     allow_origins=["*"],  # Allows all origins
 25 |     allow_credentials=True,
 26 |     allow_methods=["*"],  # Allows all methods
 27 |     allow_headers=["*"],  # Allows all headers
 28 | )
 29 | 
 30 | # New checkpoint paths
 31 | ckpt_base = 'checkpoints_v2/base_speakers/ses'
 32 | 
 33 | device = "cuda:0" if torch.cuda.is_available() else "cpu"
 34 | 
 35 | tone_color_converter = ToneColorConverter('checkpoints_v2/converter/config.json', device=device)
 36 | tone_color_converter.load_ckpt('checkpoints_v2/converter/checkpoint.pth')
 37 | 
 38 | output_dir = 'outputs'
 39 | os.makedirs(output_dir, exist_ok=True)
 40 | 
 41 | # Available base speakers
 42 | base_speakers = ['en-au', 'en-br', 'en-default', 'en-india', 'en-newest', 'en-us', 'es', 'fr', 'jp', 'kr', 'zh']
 43 | key_map = {'en-newest': ('EN-Newest', 'EN_NEWEST'),
 44 |            'en-us': ('EN-US', 'EN'),
 45 |            'en-br': ('EN-BR', 'EN'),
 46 |            'en-india': ('EN_INDIA', 'EN'),
 47 |            'en-au': ('EN-AU', 'EN'),
 48 |            'en-default': ('EN-Default', 'EN'),
 49 |            'es': ('ES', 'ES'),
 50 |            'fr': ('FR', 'FR'),
 51 |            'jp': ('JP', 'JP'),
 52 |            'kr': ('KR', 'KR'),
 53 |            'zh': ('ZH', 'ZH')
 54 |            }
 55 | 
 56 | source_se = {
 57 |     "en-newest": torch.load(f'{ckpt_base}/en-newest.pth').to(device),
 58 |     "en-us": torch.load(f'{ckpt_base}/en-us.pth').to(device),
 59 |     "en-br": torch.load(f'{ckpt_base}/en-br.pth').to(device),
 60 |     "en-india": torch.load(f'{ckpt_base}/en-india.pth').to(device),
 61 |     "en-au": torch.load(f'{ckpt_base}/en-au.pth').to(device),
 62 |     "en-default": torch.load(f'{ckpt_base}/en-default.pth').to(device),
 63 |     "es": torch.load(f'{ckpt_base}/es.pth').to(device),
 64 |     "fr": torch.load(f'{ckpt_base}/fr.pth').to(device),
 65 |     "jp": torch.load(f'{ckpt_base}/jp.pth').to(device),
 66 |     "kr": torch.load(f'{ckpt_base}/kr.pth').to(device),
 67 |     "zh": torch.load(f'{ckpt_base}/zh.pth').to(device)
 68 | }
 69 | logging.info('Loaded base speakers.')
 70 | logging.info('Loading TTS models...')
 71 | model = {}
 72 | 
 73 | # When running on CPU, only load the en-newest model
 74 | if device == "cpu":
 75 |     base_speakers = ['en-newest']
 76 | 
 77 | for accent in base_speakers:
 78 |     logging.info(f'Loading {accent}...')
 79 |     model[accent] = TTS(language=key_map[accent][1], device=device)
 80 |     logging.info('...done.')
 81 | 
 82 | logging.info('Loaded TTS models.')
 83 | 
 84 | 
 85 | class UploadAudioRequest(BaseModel):
 86 |     audio_file_label: str
 87 | 
 88 | 
 89 | @app.on_event("startup")
 90 | async def startup_event():
 91 |     test_text = "This is a test sentence generated by the OpenVoice API."
 92 |     voice = "demo_speaker0"
 93 |     await synthesize_speech(test_text, voice)
 94 | 
 95 | 
 96 | @app.get("/base_tts/")
 97 | async def base_tts(text: str, accent: Optional[str] = 'en-newest', speed: Optional[float] = 1.0):
 98 |     """
 99 |     Perform text-to-speech conversion using only the base speaker.
100 | 
101 |     :param text: The text to be converted to speech.
102 |     :type text: str
103 |     :param accent: The accent to be used for the synthesized speech, defaults to 'en-newest'.
104 |     :type accent: str, optional
105 |     :param speed: The speed of the synthesized speech, defaults to 1.0.
106 |     :type speed: float, optional
107 |     :return: The speech audio.
108 |     :rtype: .wav file
109 |     """
110 |     global model
111 | 
112 |     if accent not in model:
113 |         logging.info(f'Loading {accent}...')
114 |         model[accent] = TTS(language=key_map[accent][1], device=device)
115 |         logging.info('...done.')
116 | 
117 |     try:
118 |         save_path = f'{output_dir}/output_v2_{accent}.wav'
119 |         model[accent].tts_to_file(text, model[accent].hps.data.spk2id[key_map[accent][0]], save_path, speed=speed)
120 |         result = StreamingResponse(open(save_path, 'rb'), media_type="audio/wav")
121 |         return result
122 |     except Exception as e:
123 |         raise HTTPException(status_code=500, detail=str(e))
124 | 
125 | 
126 | @app.post("/change_voice/")
127 | async def change_voice(reference_speaker: str = Form(...), file: UploadFile = File(...), watermark: Optional[str] = "@MyShell"):
128 |     """
129 |     Change the voice of an existing audio file.
130 | 
131 |     :param reference_speaker: The name of the reference speaker.
132 |     :type reference_speaker: str
133 |     :param file: The audio file to be changed.
134 |     :type file: UploadFile
135 |     :param watermark: The watermark to be encoded in the voice conversion, defaults to '@MyShell'.
136 |     :type watermark: str, optional
137 |     :return: The audio file with the changed voice.
138 |     :rtype: .wav file
139 |     """
140 |     try:
141 |         logging.info(f'changing voice to {reference_speaker}...')
142 | 
143 |         if watermark:
144 |             logging.info(f'watermark: {watermark}')
145 | 
146 |         contents = await file.read()
147 |         temp_file = io.BytesIO(contents)
148 |         matching_files = [file for file in os.listdir("resources") if file.startswith(str(reference_speaker))]
149 |         if not matching_files:
150 |             raise HTTPException(status_code=400, detail="No matching reference speaker found.")
151 |         reference_speaker_file = f'resources/{matching_files[0]}'
152 |         target_se, audio_name = se_extractor.get_se(reference_speaker_file, tone_color_converter, target_dir='processed', vad=True)
153 |         save_path = f'{output_dir}/output_v2_{reference_speaker}.wav'
154 |         tone_color_converter.convert(
155 |             audio_src_path=temp_file,
156 |             src_se=source_se['en-newest'],
157 |             tgt_se=target_se,
158 |             output_path=save_path,
159 |             message=watermark)
160 |         result = StreamingResponse(open(save_path, 'rb'), media_type="audio/wav")
161 |         return result
162 |     except Exception as e:
163 |         raise HTTPException(status_code=500, detail=str(e))
164 | 
165 | 
166 | @app.post("/upload_audio/")
167 | async def upload_audio(audio_file_label: str = Form(...), file: UploadFile = File(...)):
168 |     """
169 |     Upload an audio file for later use as the reference audio.
170 | 
171 |     :param audio_file_label: The label for the audio file.
172 |     :param file: The audio file to be uploaded.
173 |     :type file: UploadFile
174 |     :return: Confirmation of successful upload.
175 |     :rtype: dict
176 |     """
177 |     try:
178 |         contents = await file.read()
179 | 
180 |         allowed_extensions = {'wav', 'mp3', 'flac', 'ogg'}
181 |         max_file_size = 5 * 1024 * 1024  # 5MB
182 | 
183 |         if not file.filename.split('.')[-1] in allowed_extensions:
184 |             return {"error": "Invalid file type. Allowed types are: wav, mp3, flac, ogg"}
185 | 
186 |         if len(contents) > max_file_size:
187 |             return {"error": "File size is over limit. Max size is 5MB."}
188 | 
189 |         # Note: we need to first write the file in order to check magic.
190 |         temp_file = io.BytesIO(contents)
191 |         file_format = magic.from_buffer(temp_file.read(), mime=True)
192 | 
193 |         if 'audio' not in file_format:
194 |             return {"error": "Invalid file content."}
195 | 
196 |         # Make sure the resources directory exists
197 |         os.makedirs("resources", exist_ok=True)
198 | 
199 |         # Use provided 'audio_file_label' for stored file's name.
200 |         # We retain the file extension to ensure appropriate processing later.
201 |         file_extension = file.filename.split('.')[-1]
202 |         stored_file_name = f"{audio_file_label}.{file_extension}"
203 | 
204 |         with open(f"resources/{stored_file_name}", "wb") as f:
205 |             f.write(contents)
206 | 
207 |         return {"message": f"File {file.filename} uploaded successfully with label {audio_file_label}."}
208 |     except Exception as e:
209 |         raise HTTPException(status_code=500, detail=str(e))
210 | 
211 | 
212 | @app.get("/synthesize_speech/")
213 | async def synthesize_speech(
214 |         text: str,
215 |         voice: str,
216 |         accent: Optional[str] = 'en-newest',
217 |         speed: Optional[float] = 1.0,
218 |         watermark: Optional[str] = "@MyShell"
219 | ):
220 |     """
221 |     Synthesize speech from text using a specified voice and style.
222 | 
223 |     :param text: The text to be synthesized into speech.
224 |     :type text: str
225 |     :param voice: The voice to be used for the synthesized speech.
226 |     :type voice: str
227 |     :param accent: The accent to be used for the synthesized speech, defaults to 'en-newest'.
228 |     :type accent: str, optional
229 |     :param speed: The speed of the synthesized speech, defaults to 1.0.
230 |     :type speed: float, optional
231 |     :param watermark: The watermark to be encoded in the voice conversion, defaults to '@MyShell'.
232 |     :type watermark: str, optional
233 |     :return: The synthesized speech as a .wav file.
234 |     :rtype: .wav file
235 |     """
236 |     global model
237 | 
238 |     if accent not in model:
239 |         logging.info(f'Loading {accent}...')
240 |         model[accent] = TTS(language=key_map[accent][1], device=device)
241 |         logging.info('...done.')
242 | 
243 |     start_time = time.time()
244 |     try:
245 |         logging.info(f'Generating speech for {voice}')
246 |         if watermark:
247 |             logging.info(f'watermark: {watermark}')
248 | 
249 |         # Retrieve the correct file based on the 'voice' parameter
250 |         # It should match the 'audio_file_label' used while uploading
251 |         matching_files = [file for file in os.listdir("resources") if file.startswith(voice)]
252 | 
253 |         if not matching_files:
254 |             raise HTTPException(status_code=400, detail="No matching voice found.")
255 | 
256 |         reference_speaker = f'resources/{matching_files[0]}'
257 | 
258 |         target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)
259 | 
260 |         # Run the base speaker tts
261 |         src_path = f'{output_dir}/tmp.wav'
262 |         save_path = f'{output_dir}/output_v2_{accent}.wav'
263 |         model[accent].tts_to_file(text, model[accent].hps.data.spk2id[key_map[accent][0]], src_path, speed=speed)
264 | 
265 |         # Run the tone color converter
266 |         tone_color_converter.convert(
267 |             audio_src_path=src_path,
268 |             src_se=source_se[accent],
269 |             tgt_se=target_se,
270 |             output_path=save_path,
271 |             message=watermark)
272 | 
273 |         result = StreamingResponse(open(save_path, 'rb'), media_type="audio/wav")
274 |     except Exception as e:
275 |         raise HTTPException(status_code=500, detail=str(e))
276 | 
277 |     end_time = time.time()
278 |     elapsed_time = end_time - start_time
279 | 
280 |     result.headers["X-Elapsed-Time"] = str(elapsed_time)
281 |     result.headers["X-Device-Used"] = device
282 | 
283 |     # Add CORS headers
284 |     result.headers["Access-Control-Allow-Origin"] = "*"  # Required for CORS support
285 |     result.headers["Access-Control-Allow-Credentials"] = "true"  # Required for cookies, authorization headers with HTTPS
286 |     result.headers["Access-Control-Allow-Headers"] = "Origin, Content-Type, X-Amz-Date, Authorization, X-Api-Key, X-Amz-Security-Token, locale"
287 |     result.headers["Access-Control-Allow-Methods"] = "POST, OPTIONS"
288 | 
289 |     return result
290 | 


--------------------------------------------------------------------------------
/openvoice/se_extractor.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import torch
  4 | import hashlib
  5 | import librosa
  6 | import base64
  7 | from glob import glob
  8 | import numpy as np
  9 | from pydub import AudioSegment
 10 | from faster_whisper import WhisperModel
 11 | import hashlib
 12 | import base64
 13 | import librosa
 14 | from whisper_timestamped.transcribe import get_audio_tensor, get_vad_segments
 15 | 
 16 | model_size = "medium"
 17 | # Run on GPU with FP16
 18 | model = None
 19 | def split_audio_whisper(audio_path, audio_name, target_dir='processed'):
 20 |     global model
 21 |     if model is None:
 22 |         model = WhisperModel(model_size, device="cuda", compute_type="float16")
 23 |     audio = AudioSegment.from_file(audio_path)
 24 |     max_len = len(audio)
 25 | 
 26 |     target_folder = os.path.join(target_dir, audio_name)
 27 |     
 28 |     segments, info = model.transcribe(audio_path, beam_size=5, word_timestamps=True)
 29 |     segments = list(segments)    
 30 | 
 31 |     # create directory
 32 |     os.makedirs(target_folder, exist_ok=True)
 33 |     wavs_folder = os.path.join(target_folder, 'wavs')
 34 |     os.makedirs(wavs_folder, exist_ok=True)
 35 | 
 36 |     # segments
 37 |     s_ind = 0
 38 |     start_time = None
 39 |     
 40 |     for k, w in enumerate(segments):
 41 |         # process with the time
 42 |         if k == 0:
 43 |             start_time = max(0, w.start)
 44 | 
 45 |         end_time = w.end
 46 | 
 47 |         # calculate confidence
 48 |         if len(w.words) > 0:
 49 |             confidence = sum([s.probability for s in w.words]) / len(w.words)
 50 |         else:
 51 |             confidence = 0.
 52 |         # clean text
 53 |         text = w.text.replace('...', '')
 54 | 
 55 |         # left 0.08s for each audios
 56 |         audio_seg = audio[int( start_time * 1000) : min(max_len, int(end_time * 1000) + 80)]
 57 | 
 58 |         # segment file name
 59 |         fname = f"{audio_name}_seg{s_ind}.wav"
 60 | 
 61 |         # filter out the segment shorter than 1.5s and longer than 20s
 62 |         save = audio_seg.duration_seconds > 1.5 and \
 63 |                 audio_seg.duration_seconds < 20. and \
 64 |                 len(text) >= 2 and len(text) < 200 
 65 | 
 66 |         if save:
 67 |             output_file = os.path.join(wavs_folder, fname)
 68 |             audio_seg.export(output_file, format='wav')
 69 | 
 70 |         if k < len(segments) - 1:
 71 |             start_time = max(0, segments[k+1].start - 0.08)
 72 | 
 73 |         s_ind = s_ind + 1
 74 |     return wavs_folder
 75 | 
 76 | 
 77 | def split_audio_vad(audio_path, audio_name, target_dir, split_seconds=10.0):
 78 |     SAMPLE_RATE = 16000
 79 |     audio_vad = get_audio_tensor(audio_path)
 80 |     segments = get_vad_segments(
 81 |         audio_vad,
 82 |         output_sample=True,
 83 |         min_speech_duration=0.1,
 84 |         min_silence_duration=1,
 85 |         method="silero",
 86 |     )
 87 |     segments = [(seg["start"], seg["end"]) for seg in segments]
 88 |     segments = [(float(s) / SAMPLE_RATE, float(e) / SAMPLE_RATE) for s,e in segments]
 89 |     print(segments)
 90 |     audio_active = AudioSegment.silent(duration=0)
 91 |     audio = AudioSegment.from_file(audio_path)
 92 | 
 93 |     for start_time, end_time in segments:
 94 |         audio_active += audio[int( start_time * 1000) : int(end_time * 1000)]
 95 |     
 96 |     audio_dur = audio_active.duration_seconds
 97 |     print(f'after vad: dur = {audio_dur}')
 98 |     target_folder = os.path.join(target_dir, audio_name)
 99 |     wavs_folder = os.path.join(target_folder, 'wavs')
100 |     os.makedirs(wavs_folder, exist_ok=True)
101 |     start_time = 0.
102 |     count = 0
103 |     num_splits = int(np.round(audio_dur / split_seconds))
104 |     assert num_splits > 0, 'input audio is too short'
105 |     interval = audio_dur / num_splits
106 | 
107 |     for i in range(num_splits):
108 |         end_time = min(start_time + interval, audio_dur)
109 |         if i == num_splits - 1:
110 |             end_time = audio_dur
111 |         output_file = f"{wavs_folder}/{audio_name}_seg{count}.wav"
112 |         audio_seg = audio_active[int(start_time * 1000): int(end_time * 1000)]
113 |         audio_seg.export(output_file, format='wav')
114 |         start_time = end_time
115 |         count += 1
116 |     return wavs_folder
117 | 
118 | def hash_numpy_array(audio_path):
119 |     array, _ = librosa.load(audio_path, sr=None, mono=True)
120 |     # Convert the array to bytes
121 |     array_bytes = array.tobytes()
122 |     # Calculate the hash of the array bytes
123 |     hash_object = hashlib.sha256(array_bytes)
124 |     hash_value = hash_object.digest()
125 |     # Convert the hash value to base64
126 |     base64_value = base64.b64encode(hash_value)
127 |     return base64_value.decode('utf-8')[:16].replace('/', '_^')
128 | 
129 | def get_se(audio_path, vc_model, target_dir='processed', vad=True):
130 |     device = vc_model.device
131 |     version = vc_model.version
132 |     print("OpenVoice version:", version)
133 | 
134 |     audio_name = f"{os.path.basename(audio_path).rsplit('.', 1)[0]}_{version}_{hash_numpy_array(audio_path)}"
135 |     se_path = os.path.join(target_dir, audio_name, 'se.pth')
136 | 
137 |     # if os.path.isfile(se_path):
138 |     #     se = torch.load(se_path).to(device)
139 |     #     return se, audio_name
140 |     # if os.path.isdir(audio_path):
141 |     #     wavs_folder = audio_path
142 |     
143 |     if vad:
144 |         wavs_folder = split_audio_vad(audio_path, target_dir=target_dir, audio_name=audio_name)
145 |     else:
146 |         wavs_folder = split_audio_whisper(audio_path, target_dir=target_dir, audio_name=audio_name)
147 |     
148 |     audio_segs = glob(f'{wavs_folder}/*.wav')
149 |     if len(audio_segs) == 0:
150 |         raise NotImplementedError('No audio segments found!')
151 |     
152 |     return vc_model.extract_se(audio_segs, se_save_path=se_path), audio_name
153 | 
154 | 


--------------------------------------------------------------------------------
/openvoice/text/__init__.py:
--------------------------------------------------------------------------------
 1 | """ from https://github.com/keithito/tacotron """
 2 | from openvoice.text import cleaners
 3 | from openvoice.text.symbols import symbols
 4 | 
 5 | 
 6 | # Mappings from symbol to numeric ID and vice versa:
 7 | _symbol_to_id = {s: i for i, s in enumerate(symbols)}
 8 | _id_to_symbol = {i: s for i, s in enumerate(symbols)}
 9 | 
10 | 
11 | def text_to_sequence(text, symbols, cleaner_names):
12 |   '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
13 |     Args:
14 |       text: string to convert to a sequence
15 |       cleaner_names: names of the cleaner functions to run the text through
16 |     Returns:
17 |       List of integers corresponding to the symbols in the text
18 |   '''
19 |   sequence = []
20 |   symbol_to_id = {s: i for i, s in enumerate(symbols)}
21 |   clean_text = _clean_text(text, cleaner_names)
22 |   print(clean_text)
23 |   print(f" length:{len(clean_text)}")
24 |   for symbol in clean_text:
25 |     if symbol not in symbol_to_id.keys():
26 |       continue
27 |     symbol_id = symbol_to_id[symbol]
28 |     sequence += [symbol_id]
29 |   print(f" length:{len(sequence)}")
30 |   return sequence
31 | 
32 | 
33 | def cleaned_text_to_sequence(cleaned_text, symbols):
34 |   '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
35 |     Args:
36 |       text: string to convert to a sequence
37 |     Returns:
38 |       List of integers corresponding to the symbols in the text
39 |   '''
40 |   symbol_to_id = {s: i for i, s in enumerate(symbols)}
41 |   sequence = [symbol_to_id[symbol] for symbol in cleaned_text if symbol in symbol_to_id.keys()]
42 |   return sequence
43 | 
44 | 
45 | 
46 | from openvoice.text.symbols import language_tone_start_map
47 | def cleaned_text_to_sequence_vits2(cleaned_text, tones, language, symbols, languages):
48 |     """Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
49 |     Args:
50 |       text: string to convert to a sequence
51 |     Returns:
52 |       List of integers corresponding to the symbols in the text
53 |     """
54 |     symbol_to_id = {s: i for i, s in enumerate(symbols)}
55 |     language_id_map = {s: i for i, s in enumerate(languages)}
56 |     phones = [symbol_to_id[symbol] for symbol in cleaned_text]
57 |     tone_start = language_tone_start_map[language]
58 |     tones = [i + tone_start for i in tones]
59 |     lang_id = language_id_map[language]
60 |     lang_ids = [lang_id for i in phones]
61 |     return phones, tones, lang_ids
62 | 
63 | 
64 | def sequence_to_text(sequence):
65 |   '''Converts a sequence of IDs back to a string'''
66 |   result = ''
67 |   for symbol_id in sequence:
68 |     s = _id_to_symbol[symbol_id]
69 |     result += s
70 |   return result
71 | 
72 | 
73 | def _clean_text(text, cleaner_names):
74 |   for name in cleaner_names:
75 |     cleaner = getattr(cleaners, name)
76 |     if not cleaner:
77 |       raise Exception('Unknown cleaner: %s' % name)
78 |     text = cleaner(text)
79 |   return text
80 | 


--------------------------------------------------------------------------------
/openvoice/text/cleaners.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from openvoice.text.english import english_to_lazy_ipa, english_to_ipa2, english_to_lazy_ipa2
 3 | from openvoice.text.mandarin import number_to_chinese, chinese_to_bopomofo, latin_to_bopomofo, chinese_to_romaji, chinese_to_lazy_ipa, chinese_to_ipa, chinese_to_ipa2
 4 | 
 5 | def cjke_cleaners2(text):
 6 |     text = re.sub(r'\[ZH\](.*?)\[ZH\]',
 7 |                   lambda x: chinese_to_ipa(x.group(1))+' ', text)
 8 |     text = re.sub(r'\[JA\](.*?)\[JA\]',
 9 |                   lambda x: japanese_to_ipa2(x.group(1))+' ', text)
10 |     text = re.sub(r'\[KO\](.*?)\[KO\]',
11 |                   lambda x: korean_to_ipa(x.group(1))+' ', text)
12 |     text = re.sub(r'\[EN\](.*?)\[EN\]',
13 |                   lambda x: english_to_ipa2(x.group(1))+' ', text)
14 |     text = re.sub(r'\s+$', '', text)
15 |     text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text)
16 |     return text


--------------------------------------------------------------------------------
/openvoice/text/english.py:
--------------------------------------------------------------------------------
  1 | """ from https://github.com/keithito/tacotron """
  2 | 
  3 | '''
  4 | Cleaners are transformations that run over the input text at both training and eval time.
  5 | 
  6 | Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
  7 | hyperparameter. Some cleaners are English-specific. You'll typically want to use:
  8 |   1. "english_cleaners" for English text
  9 |   2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
 10 |      the Unidecode library (https://pypi.python.org/pypi/Unidecode)
 11 |   3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
 12 |      the symbols in symbols.py to match your data).
 13 | '''
 14 | 
 15 | 
 16 | # Regular expression matching whitespace:
 17 | 
 18 | 
 19 | import re
 20 | import inflect
 21 | from unidecode import unidecode
 22 | import eng_to_ipa as ipa
 23 | _inflect = inflect.engine()
 24 | _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
 25 | _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
 26 | _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
 27 | _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
 28 | _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
 29 | _number_re = re.compile(r'[0-9]+')
 30 | 
 31 | # List of (regular expression, replacement) pairs for abbreviations:
 32 | _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
 33 |     ('mrs', 'misess'),
 34 |     ('mr', 'mister'),
 35 |     ('dr', 'doctor'),
 36 |     ('st', 'saint'),
 37 |     ('co', 'company'),
 38 |     ('jr', 'junior'),
 39 |     ('maj', 'major'),
 40 |     ('gen', 'general'),
 41 |     ('drs', 'doctors'),
 42 |     ('rev', 'reverend'),
 43 |     ('lt', 'lieutenant'),
 44 |     ('hon', 'honorable'),
 45 |     ('sgt', 'sergeant'),
 46 |     ('capt', 'captain'),
 47 |     ('esq', 'esquire'),
 48 |     ('ltd', 'limited'),
 49 |     ('col', 'colonel'),
 50 |     ('ft', 'fort'),
 51 | ]]
 52 | 
 53 | 
 54 | # List of (ipa, lazy ipa) pairs:
 55 | _lazy_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [
 56 |     ('r', 'ɹ'),
 57 |     ('æ', 'e'),
 58 |     ('ɑ', 'a'),
 59 |     ('ɔ', 'o'),
 60 |     ('ð', 'z'),
 61 |     ('θ', 's'),
 62 |     ('ɛ', 'e'),
 63 |     ('ɪ', 'i'),
 64 |     ('ʊ', 'u'),
 65 |     ('ʒ', 'ʥ'),
 66 |     ('ʤ', 'ʥ'),
 67 |     ('ˈ', '↓'),
 68 | ]]
 69 | 
 70 | # List of (ipa, lazy ipa2) pairs:
 71 | _lazy_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [
 72 |     ('r', 'ɹ'),
 73 |     ('ð', 'z'),
 74 |     ('θ', 's'),
 75 |     ('ʒ', 'ʑ'),
 76 |     ('ʤ', 'dʑ'),
 77 |     ('ˈ', '↓'),
 78 | ]]
 79 | 
 80 | # List of (ipa, ipa2) pairs
 81 | _ipa_to_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [
 82 |     ('r', 'ɹ'),
 83 |     ('ʤ', 'dʒ'),
 84 |     ('ʧ', 'tʃ')
 85 | ]]
 86 | 
 87 | 
 88 | def expand_abbreviations(text):
 89 |     for regex, replacement in _abbreviations:
 90 |         text = re.sub(regex, replacement, text)
 91 |     return text
 92 | 
 93 | 
 94 | def collapse_whitespace(text):
 95 |     return re.sub(r'\s+', ' ', text)
 96 | 
 97 | 
 98 | def _remove_commas(m):
 99 |     return m.group(1).replace(',', '')
100 | 
101 | 
102 | def _expand_decimal_point(m):
103 |     return m.group(1).replace('.', ' point ')
104 | 
105 | 
106 | def _expand_dollars(m):
107 |     match = m.group(1)
108 |     parts = match.split('.')
109 |     if len(parts) > 2:
110 |         return match + ' dollars'  # Unexpected format
111 |     dollars = int(parts[0]) if parts[0] else 0
112 |     cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
113 |     if dollars and cents:
114 |         dollar_unit = 'dollar' if dollars == 1 else 'dollars'
115 |         cent_unit = 'cent' if cents == 1 else 'cents'
116 |         return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
117 |     elif dollars:
118 |         dollar_unit = 'dollar' if dollars == 1 else 'dollars'
119 |         return '%s %s' % (dollars, dollar_unit)
120 |     elif cents:
121 |         cent_unit = 'cent' if cents == 1 else 'cents'
122 |         return '%s %s' % (cents, cent_unit)
123 |     else:
124 |         return 'zero dollars'
125 | 
126 | 
127 | def _expand_ordinal(m):
128 |     return _inflect.number_to_words(m.group(0))
129 | 
130 | 
131 | def _expand_number(m):
132 |     num = int(m.group(0))
133 |     if num > 1000 and num < 3000:
134 |         if num == 2000:
135 |             return 'two thousand'
136 |         elif num > 2000 and num < 2010:
137 |             return 'two thousand ' + _inflect.number_to_words(num % 100)
138 |         elif num % 100 == 0:
139 |             return _inflect.number_to_words(num // 100) + ' hundred'
140 |         else:
141 |             return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
142 |     else:
143 |         return _inflect.number_to_words(num, andword='')
144 | 
145 | 
146 | def normalize_numbers(text):
147 |     text = re.sub(_comma_number_re, _remove_commas, text)
148 |     text = re.sub(_pounds_re, r'\1 pounds', text)
149 |     text = re.sub(_dollars_re, _expand_dollars, text)
150 |     text = re.sub(_decimal_number_re, _expand_decimal_point, text)
151 |     text = re.sub(_ordinal_re, _expand_ordinal, text)
152 |     text = re.sub(_number_re, _expand_number, text)
153 |     return text
154 | 
155 | 
156 | def mark_dark_l(text):
157 |     return re.sub(r'l([^aeiouæɑɔəɛɪʊ ]*(?: |$))', lambda x: 'ɫ'+x.group(1), text)
158 | 
159 | 
160 | def english_to_ipa(text):
161 |     text = unidecode(text).lower()
162 |     text = expand_abbreviations(text)
163 |     text = normalize_numbers(text)
164 |     phonemes = ipa.convert(text)
165 |     phonemes = collapse_whitespace(phonemes)
166 |     return phonemes
167 | 
168 | 
169 | def english_to_lazy_ipa(text):
170 |     text = english_to_ipa(text)
171 |     for regex, replacement in _lazy_ipa:
172 |         text = re.sub(regex, replacement, text)
173 |     return text
174 | 
175 | 
176 | def english_to_ipa2(text):
177 |     text = english_to_ipa(text)
178 |     text = mark_dark_l(text)
179 |     for regex, replacement in _ipa_to_ipa2:
180 |         text = re.sub(regex, replacement, text)
181 |     return text.replace('...', '…')
182 | 
183 | 
184 | def english_to_lazy_ipa2(text):
185 |     text = english_to_ipa(text)
186 |     for regex, replacement in _lazy_ipa2:
187 |         text = re.sub(regex, replacement, text)
188 |     return text
189 | 


--------------------------------------------------------------------------------
/openvoice/text/mandarin.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import re
  4 | from pypinyin import lazy_pinyin, BOPOMOFO
  5 | import jieba
  6 | import cn2an
  7 | import logging
  8 | 
  9 | 
 10 | # List of (Latin alphabet, bopomofo) pairs:
 11 | _latin_to_bopomofo = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
 12 |     ('a', 'ㄟˉ'),
 13 |     ('b', 'ㄅㄧˋ'),
 14 |     ('c', 'ㄙㄧˉ'),
 15 |     ('d', 'ㄉㄧˋ'),
 16 |     ('e', 'ㄧˋ'),
 17 |     ('f', 'ㄝˊㄈㄨˋ'),
 18 |     ('g', 'ㄐㄧˋ'),
 19 |     ('h', 'ㄝˇㄑㄩˋ'),
 20 |     ('i', 'ㄞˋ'),
 21 |     ('j', 'ㄐㄟˋ'),
 22 |     ('k', 'ㄎㄟˋ'),
 23 |     ('l', 'ㄝˊㄛˋ'),
 24 |     ('m', 'ㄝˊㄇㄨˋ'),
 25 |     ('n', 'ㄣˉ'),
 26 |     ('o', 'ㄡˉ'),
 27 |     ('p', 'ㄆㄧˉ'),
 28 |     ('q', 'ㄎㄧㄡˉ'),
 29 |     ('r', 'ㄚˋ'),
 30 |     ('s', 'ㄝˊㄙˋ'),
 31 |     ('t', 'ㄊㄧˋ'),
 32 |     ('u', 'ㄧㄡˉ'),
 33 |     ('v', 'ㄨㄧˉ'),
 34 |     ('w', 'ㄉㄚˋㄅㄨˋㄌㄧㄡˋ'),
 35 |     ('x', 'ㄝˉㄎㄨˋㄙˋ'),
 36 |     ('y', 'ㄨㄞˋ'),
 37 |     ('z', 'ㄗㄟˋ')
 38 | ]]
 39 | 
 40 | # List of (bopomofo, romaji) pairs:
 41 | _bopomofo_to_romaji = [(re.compile('%s' % x[0]), x[1]) for x in [
 42 |     ('ㄅㄛ', 'p⁼wo'),
 43 |     ('ㄆㄛ', 'pʰwo'),
 44 |     ('ㄇㄛ', 'mwo'),
 45 |     ('ㄈㄛ', 'fwo'),
 46 |     ('ㄅ', 'p⁼'),
 47 |     ('ㄆ', 'pʰ'),
 48 |     ('ㄇ', 'm'),
 49 |     ('ㄈ', 'f'),
 50 |     ('ㄉ', 't⁼'),
 51 |     ('ㄊ', 'tʰ'),
 52 |     ('ㄋ', 'n'),
 53 |     ('ㄌ', 'l'),
 54 |     ('ㄍ', 'k⁼'),
 55 |     ('ㄎ', 'kʰ'),
 56 |     ('ㄏ', 'h'),
 57 |     ('ㄐ', 'ʧ⁼'),
 58 |     ('ㄑ', 'ʧʰ'),
 59 |     ('ㄒ', 'ʃ'),
 60 |     ('ㄓ', 'ʦ`⁼'),
 61 |     ('ㄔ', 'ʦ`ʰ'),
 62 |     ('ㄕ', 's`'),
 63 |     ('ㄖ', 'ɹ`'),
 64 |     ('ㄗ', 'ʦ⁼'),
 65 |     ('ㄘ', 'ʦʰ'),
 66 |     ('ㄙ', 's'),
 67 |     ('ㄚ', 'a'),
 68 |     ('ㄛ', 'o'),
 69 |     ('ㄜ', 'ə'),
 70 |     ('ㄝ', 'e'),
 71 |     ('ㄞ', 'ai'),
 72 |     ('ㄟ', 'ei'),
 73 |     ('ㄠ', 'au'),
 74 |     ('ㄡ', 'ou'),
 75 |     ('ㄧㄢ', 'yeNN'),
 76 |     ('ㄢ', 'aNN'),
 77 |     ('ㄧㄣ', 'iNN'),
 78 |     ('ㄣ', 'əNN'),
 79 |     ('ㄤ', 'aNg'),
 80 |     ('ㄧㄥ', 'iNg'),
 81 |     ('ㄨㄥ', 'uNg'),
 82 |     ('ㄩㄥ', 'yuNg'),
 83 |     ('ㄥ', 'əNg'),
 84 |     ('ㄦ', 'əɻ'),
 85 |     ('ㄧ', 'i'),
 86 |     ('ㄨ', 'u'),
 87 |     ('ㄩ', 'ɥ'),
 88 |     ('ˉ', '→'),
 89 |     ('ˊ', '↑'),
 90 |     ('ˇ', '↓↑'),
 91 |     ('ˋ', '↓'),
 92 |     ('˙', ''),
 93 |     ('，', ','),
 94 |     ('。', '.'),
 95 |     ('！', '!'),
 96 |     ('？', '?'),
 97 |     ('—', '-')
 98 | ]]
 99 | 
100 | # List of (romaji, ipa) pairs:
101 | _romaji_to_ipa = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
102 |     ('ʃy', 'ʃ'),
103 |     ('ʧʰy', 'ʧʰ'),
104 |     ('ʧ⁼y', 'ʧ⁼'),
105 |     ('NN', 'n'),
106 |     ('Ng', 'ŋ'),
107 |     ('y', 'j'),
108 |     ('h', 'x')
109 | ]]
110 | 
111 | # List of (bopomofo, ipa) pairs:
112 | _bopomofo_to_ipa = [(re.compile('%s' % x[0]), x[1]) for x in [
113 |     ('ㄅㄛ', 'p⁼wo'),
114 |     ('ㄆㄛ', 'pʰwo'),
115 |     ('ㄇㄛ', 'mwo'),
116 |     ('ㄈㄛ', 'fwo'),
117 |     ('ㄅ', 'p⁼'),
118 |     ('ㄆ', 'pʰ'),
119 |     ('ㄇ', 'm'),
120 |     ('ㄈ', 'f'),
121 |     ('ㄉ', 't⁼'),
122 |     ('ㄊ', 'tʰ'),
123 |     ('ㄋ', 'n'),
124 |     ('ㄌ', 'l'),
125 |     ('ㄍ', 'k⁼'),
126 |     ('ㄎ', 'kʰ'),
127 |     ('ㄏ', 'x'),
128 |     ('ㄐ', 'tʃ⁼'),
129 |     ('ㄑ', 'tʃʰ'),
130 |     ('ㄒ', 'ʃ'),
131 |     ('ㄓ', 'ts`⁼'),
132 |     ('ㄔ', 'ts`ʰ'),
133 |     ('ㄕ', 's`'),
134 |     ('ㄖ', 'ɹ`'),
135 |     ('ㄗ', 'ts⁼'),
136 |     ('ㄘ', 'tsʰ'),
137 |     ('ㄙ', 's'),
138 |     ('ㄚ', 'a'),
139 |     ('ㄛ', 'o'),
140 |     ('ㄜ', 'ə'),
141 |     ('ㄝ', 'ɛ'),
142 |     ('ㄞ', 'aɪ'),
143 |     ('ㄟ', 'eɪ'),
144 |     ('ㄠ', 'ɑʊ'),
145 |     ('ㄡ', 'oʊ'),
146 |     ('ㄧㄢ', 'jɛn'),
147 |     ('ㄩㄢ', 'ɥæn'),
148 |     ('ㄢ', 'an'),
149 |     ('ㄧㄣ', 'in'),
150 |     ('ㄩㄣ', 'ɥn'),
151 |     ('ㄣ', 'ən'),
152 |     ('ㄤ', 'ɑŋ'),
153 |     ('ㄧㄥ', 'iŋ'),
154 |     ('ㄨㄥ', 'ʊŋ'),
155 |     ('ㄩㄥ', 'jʊŋ'),
156 |     ('ㄥ', 'əŋ'),
157 |     ('ㄦ', 'əɻ'),
158 |     ('ㄧ', 'i'),
159 |     ('ㄨ', 'u'),
160 |     ('ㄩ', 'ɥ'),
161 |     ('ˉ', '→'),
162 |     ('ˊ', '↑'),
163 |     ('ˇ', '↓↑'),
164 |     ('ˋ', '↓'),
165 |     ('˙', ''),
166 |     ('，', ','),
167 |     ('。', '.'),
168 |     ('！', '!'),
169 |     ('？', '?'),
170 |     ('—', '-')
171 | ]]
172 | 
173 | # List of (bopomofo, ipa2) pairs:
174 | _bopomofo_to_ipa2 = [(re.compile('%s' % x[0]), x[1]) for x in [
175 |     ('ㄅㄛ', 'pwo'),
176 |     ('ㄆㄛ', 'pʰwo'),
177 |     ('ㄇㄛ', 'mwo'),
178 |     ('ㄈㄛ', 'fwo'),
179 |     ('ㄅ', 'p'),
180 |     ('ㄆ', 'pʰ'),
181 |     ('ㄇ', 'm'),
182 |     ('ㄈ', 'f'),
183 |     ('ㄉ', 't'),
184 |     ('ㄊ', 'tʰ'),
185 |     ('ㄋ', 'n'),
186 |     ('ㄌ', 'l'),
187 |     ('ㄍ', 'k'),
188 |     ('ㄎ', 'kʰ'),
189 |     ('ㄏ', 'h'),
190 |     ('ㄐ', 'tɕ'),
191 |     ('ㄑ', 'tɕʰ'),
192 |     ('ㄒ', 'ɕ'),
193 |     ('ㄓ', 'tʂ'),
194 |     ('ㄔ', 'tʂʰ'),
195 |     ('ㄕ', 'ʂ'),
196 |     ('ㄖ', 'ɻ'),
197 |     ('ㄗ', 'ts'),
198 |     ('ㄘ', 'tsʰ'),
199 |     ('ㄙ', 's'),
200 |     ('ㄚ', 'a'),
201 |     ('ㄛ', 'o'),
202 |     ('ㄜ', 'ɤ'),
203 |     ('ㄝ', 'ɛ'),
204 |     ('ㄞ', 'aɪ'),
205 |     ('ㄟ', 'eɪ'),
206 |     ('ㄠ', 'ɑʊ'),
207 |     ('ㄡ', 'oʊ'),
208 |     ('ㄧㄢ', 'jɛn'),
209 |     ('ㄩㄢ', 'yæn'),
210 |     ('ㄢ', 'an'),
211 |     ('ㄧㄣ', 'in'),
212 |     ('ㄩㄣ', 'yn'),
213 |     ('ㄣ', 'ən'),
214 |     ('ㄤ', 'ɑŋ'),
215 |     ('ㄧㄥ', 'iŋ'),
216 |     ('ㄨㄥ', 'ʊŋ'),
217 |     ('ㄩㄥ', 'jʊŋ'),
218 |     ('ㄥ', 'ɤŋ'),
219 |     ('ㄦ', 'əɻ'),
220 |     ('ㄧ', 'i'),
221 |     ('ㄨ', 'u'),
222 |     ('ㄩ', 'y'),
223 |     ('ˉ', '˥'),
224 |     ('ˊ', '˧˥'),
225 |     ('ˇ', '˨˩˦'),
226 |     ('ˋ', '˥˩'),
227 |     ('˙', ''),
228 |     ('，', ','),
229 |     ('。', '.'),
230 |     ('！', '!'),
231 |     ('？', '?'),
232 |     ('—', '-')
233 | ]]
234 | 
235 | 
236 | def number_to_chinese(text):
237 |     numbers = re.findall(r'\d+(?:\.?\d+)?', text)
238 |     for number in numbers:
239 |         text = text.replace(number, cn2an.an2cn(number), 1)
240 |     return text
241 | 
242 | 
243 | def chinese_to_bopomofo(text):
244 |     text = text.replace('、', '，').replace('；', '，').replace('：', '，')
245 |     words = jieba.lcut(text, cut_all=False)
246 |     text = ''
247 |     for word in words:
248 |         bopomofos = lazy_pinyin(word, BOPOMOFO)
249 |         if not re.search('[\u4e00-\u9fff]', word):
250 |             text += word
251 |             continue
252 |         for i in range(len(bopomofos)):
253 |             bopomofos[i] = re.sub(r'([\u3105-\u3129])$', r'\1ˉ', bopomofos[i])
254 |         if text != '':
255 |             text += ' '
256 |         text += ''.join(bopomofos)
257 |     return text
258 | 
259 | 
260 | def latin_to_bopomofo(text):
261 |     for regex, replacement in _latin_to_bopomofo:
262 |         text = re.sub(regex, replacement, text)
263 |     return text
264 | 
265 | 
266 | def bopomofo_to_romaji(text):
267 |     for regex, replacement in _bopomofo_to_romaji:
268 |         text = re.sub(regex, replacement, text)
269 |     return text
270 | 
271 | 
272 | def bopomofo_to_ipa(text):
273 |     for regex, replacement in _bopomofo_to_ipa:
274 |         text = re.sub(regex, replacement, text)
275 |     return text
276 | 
277 | 
278 | def bopomofo_to_ipa2(text):
279 |     for regex, replacement in _bopomofo_to_ipa2:
280 |         text = re.sub(regex, replacement, text)
281 |     return text
282 | 
283 | 
284 | def chinese_to_romaji(text):
285 |     text = number_to_chinese(text)
286 |     text = chinese_to_bopomofo(text)
287 |     text = latin_to_bopomofo(text)
288 |     text = bopomofo_to_romaji(text)
289 |     text = re.sub('i([aoe])', r'y\1', text)
290 |     text = re.sub('u([aoəe])', r'w\1', text)
291 |     text = re.sub('([ʦsɹ]`[⁼ʰ]?)([→↓↑ ]+|$)',
292 |                   r'\1ɹ`\2', text).replace('ɻ', 'ɹ`')
293 |     text = re.sub('([ʦs][⁼ʰ]?)([→↓↑ ]+|$)', r'\1ɹ\2', text)
294 |     return text
295 | 
296 | 
297 | def chinese_to_lazy_ipa(text):
298 |     text = chinese_to_romaji(text)
299 |     for regex, replacement in _romaji_to_ipa:
300 |         text = re.sub(regex, replacement, text)
301 |     return text
302 | 
303 | 
304 | def chinese_to_ipa(text):
305 |     text = number_to_chinese(text)
306 |     text = chinese_to_bopomofo(text)
307 |     text = latin_to_bopomofo(text)
308 |     text = bopomofo_to_ipa(text)
309 |     text = re.sub('i([aoe])', r'j\1', text)
310 |     text = re.sub('u([aoəe])', r'w\1', text)
311 |     text = re.sub('([sɹ]`[⁼ʰ]?)([→↓↑ ]+|$)',
312 |                   r'\1ɹ`\2', text).replace('ɻ', 'ɹ`')
313 |     text = re.sub('([s][⁼ʰ]?)([→↓↑ ]+|$)', r'\1ɹ\2', text)
314 |     return text
315 | 
316 | 
317 | def chinese_to_ipa2(text):
318 |     text = number_to_chinese(text)
319 |     text = chinese_to_bopomofo(text)
320 |     text = latin_to_bopomofo(text)
321 |     text = bopomofo_to_ipa2(text)
322 |     text = re.sub(r'i([aoe])', r'j\1', text)
323 |     text = re.sub(r'u([aoəe])', r'w\1', text)
324 |     text = re.sub(r'([ʂɹ]ʰ?)([˩˨˧˦˥ ]+|$)', r'\1ʅ\2', text)
325 |     text = re.sub(r'(sʰ?)([˩˨˧˦˥ ]+|$)', r'\1ɿ\2', text)
326 |     return text
327 | 


--------------------------------------------------------------------------------
/openvoice/text/symbols.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Defines the set of symbols used in text input to the model.
 3 | '''
 4 | 
 5 | # japanese_cleaners
 6 | # _pad        = '_'
 7 | # _punctuation = ',.!?-'
 8 | # _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧ↓↑ '
 9 | 
10 | 
11 | '''# japanese_cleaners2
12 | _pad        = '_'
13 | _punctuation = ',.!?-~…'
14 | _letters = 'AEINOQUabdefghijkmnoprstuvwyzʃʧʦ↓↑ '
15 | '''
16 | 
17 | 
18 | '''# korean_cleaners
19 | _pad        = '_'
20 | _punctuation = ',.!?…~'
21 | _letters = 'ㄱㄴㄷㄹㅁㅂㅅㅇㅈㅊㅋㅌㅍㅎㄲㄸㅃㅆㅉㅏㅓㅗㅜㅡㅣㅐㅔ '
22 | '''
23 | 
24 | '''# chinese_cleaners
25 | _pad        = '_'
26 | _punctuation = '，。！？—…'
27 | _letters = 'ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩˉˊˇˋ˙ '
28 | '''
29 | 
30 | # # zh_ja_mixture_cleaners
31 | # _pad        = '_'
32 | # _punctuation = ',.!?-~…'
33 | # _letters = 'AEINOQUabdefghijklmnoprstuvwyzʃʧʦɯɹəɥ⁼ʰ`→↓↑ '
34 | 
35 | 
36 | '''# sanskrit_cleaners
37 | _pad        = '_'
38 | _punctuation = '।'
39 | _letters = 'ँंःअआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलळवशषसहऽािीुूृॄेैोौ्ॠॢ '
40 | '''
41 | 
42 | '''# cjks_cleaners
43 | _pad        = '_'
44 | _punctuation = ',.!?-~…'
45 | _letters = 'NQabdefghijklmnopstuvwxyzʃʧʥʦɯɹəɥçɸɾβŋɦː⁼ʰ`^#*=→↓↑ '
46 | '''
47 | 
48 | '''# thai_cleaners
49 | _pad        = '_'
50 | _punctuation = '.!? '
51 | _letters = 'กขฃคฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลวศษสหฬอฮฯะัาำิีึืุูเแโใไๅๆ็่้๊๋์'
52 | '''
53 | 
54 | # # cjke_cleaners2
55 | _pad        = '_'
56 | _punctuation = ',.!?-~…'
57 | _letters = 'NQabdefghijklmnopstuvwxyzɑæʃʑçɯɪɔɛɹðəɫɥɸʊɾʒθβŋɦ⁼ʰ`^#*=ˈˌ→↓↑ '
58 | 
59 | 
60 | '''# shanghainese_cleaners
61 | _pad        = '_'
62 | _punctuation = ',.!?…'
63 | _letters = 'abdfghiklmnopstuvyzøŋȵɑɔɕəɤɦɪɿʑʔʰ̩̃ᴀᴇ15678 '
64 | '''
65 | 
66 | '''# chinese_dialect_cleaners
67 | _pad        = '_'
68 | _punctuation = ',.!?~…─'
69 | _letters = '#Nabdefghijklmnoprstuvwxyzæçøŋœȵɐɑɒɓɔɕɗɘəɚɛɜɣɤɦɪɭɯɵɷɸɻɾɿʂʅʊʋʌʏʑʔʦʮʰʷˀː˥˦˧˨˩̥̩̃̚ᴀᴇ↑↓∅ⱼ '
70 | '''
71 | 
72 | # Export all symbols:
73 | symbols = [_pad] + list(_punctuation) + list(_letters)
74 | 
75 | # Special symbol ids
76 | SPACE_ID = symbols.index(" ")
77 | 
78 | num_ja_tones = 1
79 | num_kr_tones = 1
80 | num_zh_tones = 6
81 | num_en_tones = 4
82 | 
83 | language_tone_start_map = {
84 |     "ZH": 0,
85 |     "JP": num_zh_tones,
86 |     "EN": num_zh_tones + num_ja_tones,
87 |     'KR': num_zh_tones + num_ja_tones + num_en_tones,
88 | }


--------------------------------------------------------------------------------
/openvoice/transforms.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch.nn import functional as F
  3 | 
  4 | import numpy as np
  5 | 
  6 | 
  7 | DEFAULT_MIN_BIN_WIDTH = 1e-3
  8 | DEFAULT_MIN_BIN_HEIGHT = 1e-3
  9 | DEFAULT_MIN_DERIVATIVE = 1e-3
 10 | 
 11 | 
 12 | def piecewise_rational_quadratic_transform(
 13 |     inputs,
 14 |     unnormalized_widths,
 15 |     unnormalized_heights,
 16 |     unnormalized_derivatives,
 17 |     inverse=False,
 18 |     tails=None,
 19 |     tail_bound=1.0,
 20 |     min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 21 |     min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 22 |     min_derivative=DEFAULT_MIN_DERIVATIVE,
 23 | ):
 24 |     if tails is None:
 25 |         spline_fn = rational_quadratic_spline
 26 |         spline_kwargs = {}
 27 |     else:
 28 |         spline_fn = unconstrained_rational_quadratic_spline
 29 |         spline_kwargs = {"tails": tails, "tail_bound": tail_bound}
 30 | 
 31 |     outputs, logabsdet = spline_fn(
 32 |         inputs=inputs,
 33 |         unnormalized_widths=unnormalized_widths,
 34 |         unnormalized_heights=unnormalized_heights,
 35 |         unnormalized_derivatives=unnormalized_derivatives,
 36 |         inverse=inverse,
 37 |         min_bin_width=min_bin_width,
 38 |         min_bin_height=min_bin_height,
 39 |         min_derivative=min_derivative,
 40 |         **spline_kwargs
 41 |     )
 42 |     return outputs, logabsdet
 43 | 
 44 | 
 45 | def searchsorted(bin_locations, inputs, eps=1e-6):
 46 |     bin_locations[..., -1] += eps
 47 |     return torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1
 48 | 
 49 | 
 50 | def unconstrained_rational_quadratic_spline(
 51 |     inputs,
 52 |     unnormalized_widths,
 53 |     unnormalized_heights,
 54 |     unnormalized_derivatives,
 55 |     inverse=False,
 56 |     tails="linear",
 57 |     tail_bound=1.0,
 58 |     min_bin_width=DEFAULT_MIN_BIN_WIDTH,
 59 |     min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
 60 |     min_derivative=DEFAULT_MIN_DERIVATIVE,
 61 | ):
 62 |     inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
 63 |     outside_interval_mask = ~inside_interval_mask
 64 | 
 65 |     outputs = torch.zeros_like(inputs)
 66 |     logabsdet = torch.zeros_like(inputs)
 67 | 
 68 |     if tails == "linear":
 69 |         unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
 70 |         constant = np.log(np.exp(1 - min_derivative) - 1)
 71 |         unnormalized_derivatives[..., 0] = constant
 72 |         unnormalized_derivatives[..., -1] = constant
 73 | 
 74 |         outputs[outside_interval_mask] = inputs[outside_interval_mask]
 75 |         logabsdet[outside_interval_mask] = 0
 76 |     else:
 77 |         raise RuntimeError("{} tails are not implemented.".format(tails))
 78 | 
 79 |     (
 80 |         outputs[inside_interval_mask],
 81 |         logabsdet[inside_interval_mask],
 82 |     ) = rational_quadratic_spline(
 83 |         inputs=inputs[inside_interval_mask],
 84 |         unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
 85 |         unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
 86 |         unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
 87 |         inverse=inverse,
 88 |         left=-tail_bound,
 89 |         right=tail_bound,
 90 |         bottom=-tail_bound,
 91 |         top=tail_bound,
 92 |         min_bin_width=min_bin_width,
 93 |         min_bin_height=min_bin_height,
 94 |         min_derivative=min_derivative,
 95 |     )
 96 | 
 97 |     return outputs, logabsdet
 98 | 
 99 | 
100 | def rational_quadratic_spline(
101 |     inputs,
102 |     unnormalized_widths,
103 |     unnormalized_heights,
104 |     unnormalized_derivatives,
105 |     inverse=False,
106 |     left=0.0,
107 |     right=1.0,
108 |     bottom=0.0,
109 |     top=1.0,
110 |     min_bin_width=DEFAULT_MIN_BIN_WIDTH,
111 |     min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
112 |     min_derivative=DEFAULT_MIN_DERIVATIVE,
113 | ):
114 |     if torch.min(inputs) < left or torch.max(inputs) > right:
115 |         raise ValueError("Input to a transform is not within its domain")
116 | 
117 |     num_bins = unnormalized_widths.shape[-1]
118 | 
119 |     if min_bin_width * num_bins > 1.0:
120 |         raise ValueError("Minimal bin width too large for the number of bins")
121 |     if min_bin_height * num_bins > 1.0:
122 |         raise ValueError("Minimal bin height too large for the number of bins")
123 | 
124 |     widths = F.softmax(unnormalized_widths, dim=-1)
125 |     widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
126 |     cumwidths = torch.cumsum(widths, dim=-1)
127 |     cumwidths = F.pad(cumwidths, pad=(1, 0), mode="constant", value=0.0)
128 |     cumwidths = (right - left) * cumwidths + left
129 |     cumwidths[..., 0] = left
130 |     cumwidths[..., -1] = right
131 |     widths = cumwidths[..., 1:] - cumwidths[..., :-1]
132 | 
133 |     derivatives = min_derivative + F.softplus(unnormalized_derivatives)
134 | 
135 |     heights = F.softmax(unnormalized_heights, dim=-1)
136 |     heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
137 |     cumheights = torch.cumsum(heights, dim=-1)
138 |     cumheights = F.pad(cumheights, pad=(1, 0), mode="constant", value=0.0)
139 |     cumheights = (top - bottom) * cumheights + bottom
140 |     cumheights[..., 0] = bottom
141 |     cumheights[..., -1] = top
142 |     heights = cumheights[..., 1:] - cumheights[..., :-1]
143 | 
144 |     if inverse:
145 |         bin_idx = searchsorted(cumheights, inputs)[..., None]
146 |     else:
147 |         bin_idx = searchsorted(cumwidths, inputs)[..., None]
148 | 
149 |     input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
150 |     input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
151 | 
152 |     input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
153 |     delta = heights / widths
154 |     input_delta = delta.gather(-1, bin_idx)[..., 0]
155 | 
156 |     input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
157 |     input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
158 | 
159 |     input_heights = heights.gather(-1, bin_idx)[..., 0]
160 | 
161 |     if inverse:
162 |         a = (inputs - input_cumheights) * (
163 |             input_derivatives + input_derivatives_plus_one - 2 * input_delta
164 |         ) + input_heights * (input_delta - input_derivatives)
165 |         b = input_heights * input_derivatives - (inputs - input_cumheights) * (
166 |             input_derivatives + input_derivatives_plus_one - 2 * input_delta
167 |         )
168 |         c = -input_delta * (inputs - input_cumheights)
169 | 
170 |         discriminant = b.pow(2) - 4 * a * c
171 |         assert (discriminant >= 0).all()
172 | 
173 |         root = (2 * c) / (-b - torch.sqrt(discriminant))
174 |         outputs = root * input_bin_widths + input_cumwidths
175 | 
176 |         theta_one_minus_theta = root * (1 - root)
177 |         denominator = input_delta + (
178 |             (input_derivatives + input_derivatives_plus_one - 2 * input_delta)
179 |             * theta_one_minus_theta
180 |         )
181 |         derivative_numerator = input_delta.pow(2) * (
182 |             input_derivatives_plus_one * root.pow(2)
183 |             + 2 * input_delta * theta_one_minus_theta
184 |             + input_derivatives * (1 - root).pow(2)
185 |         )
186 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
187 | 
188 |         return outputs, -logabsdet
189 |     else:
190 |         theta = (inputs - input_cumwidths) / input_bin_widths
191 |         theta_one_minus_theta = theta * (1 - theta)
192 | 
193 |         numerator = input_heights * (
194 |             input_delta * theta.pow(2) + input_derivatives * theta_one_minus_theta
195 |         )
196 |         denominator = input_delta + (
197 |             (input_derivatives + input_derivatives_plus_one - 2 * input_delta)
198 |             * theta_one_minus_theta
199 |         )
200 |         outputs = input_cumheights + numerator / denominator
201 | 
202 |         derivative_numerator = input_delta.pow(2) * (
203 |             input_derivatives_plus_one * theta.pow(2)
204 |             + 2 * input_delta * theta_one_minus_theta
205 |             + input_derivatives * (1 - theta).pow(2)
206 |         )
207 |         logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
208 | 
209 |         return outputs, logabsdet
210 | 


--------------------------------------------------------------------------------
/openvoice/utils.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | import numpy as np
  4 | 
  5 | 
  6 | def get_hparams_from_file(config_path):
  7 |     with open(config_path, "r", encoding="utf-8") as f:
  8 |         data = f.read()
  9 |     config = json.loads(data)
 10 | 
 11 |     hparams = HParams(**config)
 12 |     return hparams
 13 | 
 14 | class HParams:
 15 |     def __init__(self, **kwargs):
 16 |         for k, v in kwargs.items():
 17 |             if type(v) == dict:
 18 |                 v = HParams(**v)
 19 |             self[k] = v
 20 | 
 21 |     def keys(self):
 22 |         return self.__dict__.keys()
 23 | 
 24 |     def items(self):
 25 |         return self.__dict__.items()
 26 | 
 27 |     def values(self):
 28 |         return self.__dict__.values()
 29 | 
 30 |     def __len__(self):
 31 |         return len(self.__dict__)
 32 | 
 33 |     def __getitem__(self, key):
 34 |         return getattr(self, key)
 35 | 
 36 |     def __setitem__(self, key, value):
 37 |         return setattr(self, key, value)
 38 | 
 39 |     def __contains__(self, key):
 40 |         return key in self.__dict__
 41 | 
 42 |     def __repr__(self):
 43 |         return self.__dict__.__repr__()
 44 | 
 45 | 
 46 | def string_to_bits(string, pad_len=8):
 47 |     # Convert each character to its ASCII value
 48 |     ascii_values = [ord(char) for char in string]
 49 |     
 50 |     # Convert ASCII values to binary representation
 51 |     binary_values = [bin(value)[2:].zfill(8) for value in ascii_values]
 52 |     
 53 |     # Convert binary strings to integer arrays
 54 |     bit_arrays = [[int(bit) for bit in binary] for binary in binary_values]
 55 |     
 56 |     # Convert list of arrays to NumPy array
 57 |     numpy_array = np.array(bit_arrays)
 58 |     numpy_array_full = np.zeros((pad_len, 8), dtype=numpy_array.dtype)
 59 |     numpy_array_full[:, 2] = 1
 60 |     max_len = min(pad_len, len(numpy_array))
 61 |     numpy_array_full[:max_len] = numpy_array[:max_len]
 62 |     return numpy_array_full
 63 | 
 64 | 
 65 | def bits_to_string(bits_array):
 66 |     # Convert each row of the array to a binary string
 67 |     binary_values = [''.join(str(bit) for bit in row) for row in bits_array]
 68 |     
 69 |     # Convert binary strings to ASCII values
 70 |     ascii_values = [int(binary, 2) for binary in binary_values]
 71 |     
 72 |     # Convert ASCII values to characters
 73 |     output_string = ''.join(chr(value) for value in ascii_values)
 74 |     
 75 |     return output_string
 76 | 
 77 | 
 78 | def split_sentence(text, min_len=10, language_str='[EN]'):
 79 |     if language_str in ['EN']:
 80 |         sentences = split_sentences_latin(text, min_len=min_len)
 81 |     else:
 82 |         sentences = split_sentences_zh(text, min_len=min_len)
 83 |     return sentences
 84 | 
 85 | def split_sentences_latin(text, min_len=10):
 86 |     """Split Long sentences into list of short ones
 87 | 
 88 |     Args:
 89 |         str: Input sentences.
 90 | 
 91 |     Returns:
 92 |         List[str]: list of output sentences.
 93 |     """
 94 |     # deal with dirty sentences
 95 |     text = re.sub('[。！？；]', '.', text)
 96 |     text = re.sub('[，]', ',', text)
 97 |     text = re.sub('[“”]', '"', text)
 98 |     text = re.sub('[‘’]', "'", text)
 99 |     text = re.sub(r"[\<\>\(\)\[\]\"\«\»]+", "", text)
100 |     text = re.sub('[\n\t ]+', ' ', text)
101 |     text = re.sub('([,.!?;])', r'\1 $#!', text)
102 |     # split
103 |     sentences = [s.strip() for s in text.split('$#!')]
104 |     if len(sentences[-1]) == 0: del sentences[-1]
105 | 
106 |     new_sentences = []
107 |     new_sent = []
108 |     count_len = 0
109 |     for ind, sent in enumerate(sentences):
110 |         # print(sent)
111 |         new_sent.append(sent)
112 |         count_len += len(sent.split(" "))
113 |         if count_len > min_len or ind == len(sentences) - 1:
114 |             count_len = 0
115 |             new_sentences.append(' '.join(new_sent))
116 |             new_sent = []
117 |     return merge_short_sentences_latin(new_sentences)
118 | 
119 | 
120 | def merge_short_sentences_latin(sens):
121 |     """Avoid short sentences by merging them with the following sentence.
122 | 
123 |     Args:
124 |         List[str]: list of input sentences.
125 | 
126 |     Returns:
127 |         List[str]: list of output sentences.
128 |     """
129 |     sens_out = []
130 |     for s in sens:
131 |         # If the previous sentence is too short, merge them with
132 |         # the current sentence.
133 |         if len(sens_out) > 0 and len(sens_out[-1].split(" ")) <= 2:
134 |             sens_out[-1] = sens_out[-1] + " " + s
135 |         else:
136 |             sens_out.append(s)
137 |     try:
138 |         if len(sens_out[-1].split(" ")) <= 2:
139 |             sens_out[-2] = sens_out[-2] + " " + sens_out[-1]
140 |             sens_out.pop(-1)
141 |     except:
142 |         pass
143 |     return sens_out
144 | 
145 | def split_sentences_zh(text, min_len=10):
146 |     text = re.sub('[。！？；]', '.', text)
147 |     text = re.sub('[，]', ',', text)
148 |     # 将文本中的换行符、空格和制表符替换为空格
149 |     text = re.sub('[\n\t ]+', ' ', text)
150 |     # 在标点符号后添加一个空格
151 |     text = re.sub('([,.!?;])', r'\1 $#!', text)
152 |     # 分隔句子并去除前后空格
153 |     # sentences = [s.strip() for s in re.split('(。|！|？|；)', text)]
154 |     sentences = [s.strip() for s in text.split('$#!')]
155 |     if len(sentences[-1]) == 0: del sentences[-1]
156 | 
157 |     new_sentences = []
158 |     new_sent = []
159 |     count_len = 0
160 |     for ind, sent in enumerate(sentences):
161 |         new_sent.append(sent)
162 |         count_len += len(sent)
163 |         if count_len > min_len or ind == len(sentences) - 1:
164 |             count_len = 0
165 |             new_sentences.append(' '.join(new_sent))
166 |             new_sent = []
167 |     return merge_short_sentences_zh(new_sentences)
168 | 
169 | 
170 | def merge_short_sentences_zh(sens):
171 |     # return sens
172 |     """Avoid short sentences by merging them with the following sentence.
173 | 
174 |     Args:
175 |         List[str]: list of input sentences.
176 | 
177 |     Returns:
178 |         List[str]: list of output sentences.
179 |     """
180 |     sens_out = []
181 |     for s in sens:
182 |         # If the previous sentense is too short, merge them with
183 |         # the current sentence.
184 |         if len(sens_out) > 0 and len(sens_out[-1]) <= 2:
185 |             sens_out[-1] = sens_out[-1] + " " + s
186 |         else:
187 |             sens_out.append(s)
188 |     try:
189 |         if len(sens_out[-1]) <= 2:
190 |             sens_out[-2] = sens_out[-2] + " " + sens_out[-1]
191 |             sens_out.pop(-1)
192 |     except:
193 |         pass
194 |     return sens_out


--------------------------------------------------------------------------------
/openvoice_server/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use an official Python runtime as a parent image
 2 | FROM continuumio/miniconda3:24.1.2-0
 3 | 
 4 | # Set the working directory in the container to /app
 5 | WORKDIR /app
 6 | 
 7 | # Add current directory files to /app in the container
 8 | ADD . /app
 9 | 
10 | # Clone the git repo
11 | RUN git clone https://github.com/ValyrianTech/OpenVoice_server.git
12 | 
13 | # Set the working directory to OpenVoice
14 | WORKDIR /app/OpenVoice_server
15 | 
16 | # Create a new conda environment with python 3.9 and activate it
17 | RUN conda create -n openvoice python=3.9
18 | RUN echo "source activate openvoice" > ~/.bashrc
19 | ENV PATH /opt/conda/envs/openvoice/bin:$PATH
20 | # force python to use unbuffered mode for logging
21 | ENV PYTHONUNBUFFERED=1
22 | 
23 | # Install the OpenVoice package and uvicorn for FastAPI
24 | # Install MeloTTS
25 | RUN git clone https://github.com/myshell-ai/MeloTTS.git && \
26 |   cd MeloTTS && \
27 |   pip install -e . && \
28 |   python -m unidic download && \
29 |   cd ..
30 | 
31 | RUN pip install -e . uvicorn ffmpeg
32 | RUN pip install -r requirements.txt
33 | 
34 | # Download and extract the checkpoint file
35 | RUN apt-get update && apt-get install -y unzip wget
36 | RUN wget https://myshell-public-repo-hosting.s3.amazonaws.com/openvoice/checkpoints_v2_0417.zip
37 | RUN unzip checkpoints_v2_0417.zip -d ./openvoice
38 | RUN rm checkpoints_v2_0417.zip
39 | RUN mv resources openvoice/resources
40 | 
41 | # Make port 8000 available to the world outside this container
42 | EXPOSE 8000
43 | EXPOSE 7860
44 | 
45 | RUN cd /app/OpenVoice_server/openvoice
46 | WORKDIR /app/OpenVoice_server/openvoice
47 | 
48 | RUN conda install ffmpeg
49 | RUN conda install --yes libmagic
50 | 
51 | # copy the startup script into the container
52 | COPY start.sh /app/OpenVoice_server/openvoice/start.sh
53 | 
54 | # Provide permissions to execute the script
55 | RUN chmod +x /app/OpenVoice_server/openvoice/start.sh
56 | 
57 | # Start the server once to initiate the first time setup that downloads some models.
58 | COPY start_and_stop_server.sh /app/OpenVoice_server/openvoice/start_and_stop_server.sh
59 | RUN chmod +x /app/OpenVoice_server/openvoice/start_and_stop_server.sh
60 | RUN /app/OpenVoice_server/openvoice/start_and_stop_server.sh
61 | 
62 | # Run the startup script which installs libmagic and starts the server, when the container launches
63 | CMD ["bash", "/app/OpenVoice_server/openvoice/start.sh"]
64 | 


--------------------------------------------------------------------------------
/openvoice_server/start.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | # Activate conda environment
3 | source activate openvoice
4 | 
5 | # Start the server
6 | uvicorn openvoice_server:app --host "0.0.0.0" --port 8000
7 | 


--------------------------------------------------------------------------------
/openvoice_server/start_and_stop_server.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Activate conda environment
 3 | source activate openvoice
 4 | # Install libmagic
 5 | conda install --yes libmagic
 6 | # Start the server
 7 | uvicorn openvoice_server:app --host "0.0.0.0" --port 8000 &
 8 | # Get its PID
 9 | PID=$!
10 | # Wait for 300 seconds  to allow some models to download
11 | sleep 300
12 | # Kill the process
13 | kill $PID
14 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | librosa==0.9.1
 2 | faster-whisper==0.9.0
 3 | pydub==0.25.1
 4 | wavmark==0.0.3
 5 | numpy==1.22.0
 6 | eng_to_ipa==0.0.2
 7 | inflect==7.0.0
 8 | unidecode==1.3.7
 9 | whisper-timestamped==1.14.2
10 | openai
11 | python-dotenv
12 | pypinyin==0.50.0
13 | cn2an==0.5.22
14 | jieba==0.42.1
15 | gradio==3.48.0
16 | langid==1.1.6
17 | botocore==1.34.88
18 | cached_path==1.6.2
19 | python-magic==0.4.27
20 | libmagic==1.0
21 | fastapi==0.110.1
22 | torch==2.2.2
23 | soundfile==0.12.1
24 | starlette==0.37.2
25 | pydantic==2.6.4
26 | requests==2.31.0


--------------------------------------------------------------------------------
/resources/demo_speaker0.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/demo_speaker0.mp3


--------------------------------------------------------------------------------
/resources/demo_speaker1.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/demo_speaker1.mp3


--------------------------------------------------------------------------------
/resources/demo_speaker2.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/demo_speaker2.mp3


--------------------------------------------------------------------------------
/resources/example_reference.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/example_reference.mp3


--------------------------------------------------------------------------------
/resources/framework-ipa.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/framework-ipa.png


--------------------------------------------------------------------------------
/resources/huggingface.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/huggingface.png


--------------------------------------------------------------------------------
/resources/lepton-hd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/lepton-hd.png


--------------------------------------------------------------------------------
/resources/myshell-hd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/myshell-hd.png


--------------------------------------------------------------------------------
/resources/openvoicelogo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/openvoicelogo.jpg


--------------------------------------------------------------------------------
/resources/tts-guide.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/tts-guide.png


--------------------------------------------------------------------------------
/resources/voice-clone-guide.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ValyrianTech/OpenVoice_server/310f658a0db23852fa16dabc24401f86a4d9f199/resources/voice-clone-guide.png


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | 
 4 | setup(name='MyShell-OpenVoice',
 5 |       version='0.0.0',
 6 |       description='Instant voice cloning by MyShell.',
 7 |       long_description=open('README.md').read().strip(),
 8 |       long_description_content_type='text/markdown',
 9 |       keywords=[
10 |             'text-to-speech',
11 |             'tts',
12 |             'voice-clone',
13 |             'zero-shot-tts'
14 |       ],
15 |       url='https://github.com/myshell-ai/OpenVoice',
16 |       project_urls={
17 |         'Documentation': 'https://github.com/myshell-ai/OpenVoice/blob/main/docs/USAGE.md',
18 |         'Changes': 'https://github.com/myshell-ai/OpenVoice/releases',
19 |         'Code': 'https://github.com/myshell-ai/OpenVoice',
20 |         'Issue tracker': 'https://github.com/myshell-ai/OpenVoice/issues',
21 |       },
22 |       author='MyShell',
23 |       author_email='ethan@myshell.ai',
24 |       license='MIT License',
25 |       packages=find_packages(),
26 | 
27 |       python_requires='>=3.9',
28 |       install_requires=[
29 |             'librosa==0.9.1',
30 |             'faster-whisper==0.9.0',
31 |             'pydub==0.25.1',
32 |             'wavmark==0.0.3',
33 |             'numpy==1.22.0',
34 |             'eng_to_ipa==0.0.2',
35 |             'inflect==7.0.0',
36 |             'unidecode==1.3.7',
37 |             'whisper-timestamped==1.14.2',
38 |             'pypinyin==0.50.0',
39 |             'cn2an==0.5.22',
40 |             'jieba==0.42.1',
41 |             'gradio==3.48.0',
42 |             'langid==1.1.6'
43 |       ],
44 |       zip_safe=False
45 |       )
46 | 


--------------------------------------------------------------------------------