├── .gitignore ├── LICENSE ├── MANIFEST.in ├── README.md ├── requirements.txt ├── setup.py └── tts_server ├── __init__.py ├── cli.py ├── server.py └── voices ├── en-f-1.wav ├── en-f-2.wav ├── en-f-3.wav ├── en-m-1.wav ├── en-m-2.wav └── en-m-3.wav /.gitignore: -------------------------------------------------------------------------------- 1 | .conda-env/ 2 | sessions/ 3 | build/ 4 | *.egg-info 5 | __pycache__ 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2023 Aleksey Smolenchuk 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | recursive-include tts_server * 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🎤 TTS-Server 2 | 3 | ![](https://img.shields.io/badge/no-bugs-brightgreen.svg) ![](https://img.shields.io/badge/coverage-%F0%9F%92%AF-green.svg) 4 | 5 | This project is a Text-to-Speech (TTS) server implemented in Python using Flask. It utilizes [StyleTTS2](https://github.com/yl4579/StyleTTS2) for high-quality speech synthesis. The server facilitates the generation of speech from text with various voice styles and parameters, and manages a queue for processing TTS requests. 6 | 7 | ## Features 8 | 9 | - Streaming generation allows for long-form narration 10 | - Upload your own voice style files 11 | - Voice cache improves generation performance 12 | - No direct dependency on espeak (but it's required to be installed on your system -- see prerequisites) 13 | - CLI client 14 | 15 | ## Prerequisites 16 | 17 | You need torch and [phonemizer](https://pypi.org/project/phonemizer/) installed on your system / environment: 18 | 19 | ```bash 20 | pip install phonemizer 21 | ``` 22 | 23 | ## Starting The Server 24 | 25 | Install the server a 26 | 27 | ```bash 28 | pip install git+https://github.com/lxe/tts-server.git 29 | ``` 30 | 31 | Run the server: 32 | 33 | ```bash 34 | python -m tts_server.server 35 | ``` 36 | 37 | You can use `--host ` and `--port ` to specify the server's address and port. `--help` for more options. 38 | 39 | 40 | ## HTTP API 41 | 42 | ### Create New Session 43 | - **Method**: POST 44 | - **URL**: `/session/new` 45 | - **Request** (Form-data or JSON): 46 | - `voice` (file, optional): Upload a voice file. 47 | - `voice` (string, optional): Predefined voice name. 48 | - **Response** (JSON): 49 | - `message`: Confirmation message. 50 | - `voice`: Voice used in the session. 51 | - `session_id`: Generated session ID. 52 | 53 | ### Perform TTS 54 | - **Method**: POST 55 | - **URL**: `/tts` 56 | - **Request** (JSON): 57 | - `sessionId` (integer, required): Session identifier. 58 | - `text` (string, required): Text to be synthesized. 59 | - `alpha`, `beta`, `diffusion_steps`, `embedding_scale` (floats/integers, optional): TTS parameters. 60 | - **Response** (audio/wav): 61 | - Binary WAV file. 62 | 63 | ### Get Available Voices 64 | - **Method**: GET 65 | - **URL**: `/voices` 66 | - **Response** (JSON): 67 | - List of available voice keys. 68 | 69 | ### Error Handling 70 | - Custom error responses for various server exceptions. 71 | 72 | ## CLI Client 73 | 74 | The server can be interacted with via CURL commands: 75 | 76 | ```bash 77 | curl -X POST -H "Content-Type: application/json" -H "Accept: audio/wav" -d '{ 78 | "sessionId": 12345, 79 | "text": "Embrace the chaos and let your words dance to the rhythm of imagination!", 80 | "alpha": 0.2, 81 | "beta": 0.4, 82 | "diffusion_steps": 10, 83 | "embedding_scale": 1.5 84 | }' "http://localhost:5050/tts" | aplay 85 | ``` 86 | 87 | ...or the CLI client... 88 | 89 | ```bash 90 | python -m tts_server.cli "In a fantastical forest, flittering fireflies illuminate the night, casting a mesmerizing dance of light and shadow beneath the ancient, gnarled trees." 91 | ``` 92 | 93 | You can pass longer text to the cli for long-form narration: 94 | 95 | ```bash 96 | echo "As the fireflies twinkle in harmonious rhythm, their gentle glow reveals the secrets of the woodland. Tiny creatures, hidden from sight by day, emerge to partake in this nocturnal spectacle. Frogs serenade with their melodic croaks, and owls, wise sentinels of the night, exchange hoots that echo through the enchanted forest." | python -m tts_server.cli - 97 | ``` 98 | 99 | ## Credits 100 | 101 | This project utilizes [StyleTTS2](https://github.com/yl4579/StyleTTS2) for its text-to-speech synthesis capabilities. Special thanks to @yl4579 for creating StyleTTS2 and @fakerybakery for the [styletts2-importable](https://github.com/fakerybakery/StyleTTS2/blob/main/styletts2/inference.py) for inspiration on `tts.py` 102 | 103 | ## License 104 | 105 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | flask 2 | soundfile 3 | git+https://github.com/lxe/styletts2.git 4 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name="tts_server", 5 | version="1.0.0", 6 | packages=find_packages(), 7 | include_package_data=True, 8 | install_requires=[ 9 | "flask", 10 | "styletts2 @ git+https://github.com/lxe/styletts2.git", 11 | ], 12 | entry_points={ 13 | "console_scripts": [ 14 | "tts-server = tts_server.server:main", 15 | "tts-server-cli = tts_server.cli:main", 16 | ], 17 | }, 18 | ) -------------------------------------------------------------------------------- /tts_server/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/__init__.py -------------------------------------------------------------------------------- /tts_server/cli.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import argparse 3 | import sounddevice as sd 4 | import numpy as np 5 | import time 6 | import sys 7 | import threading 8 | import queue 9 | import signal 10 | import nltk.data 11 | import random 12 | import threading 13 | 14 | sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle') 15 | 16 | url = 'http://localhost:5050' 17 | session_id = random.randint(0, 4294967295) 18 | audio_queue = queue.Queue() 19 | 20 | def play_audio(): 21 | while True: 22 | wav = audio_queue.get() 23 | if wav is None: 24 | break 25 | sd.play(wav, 24000) 26 | sd.wait() 27 | 28 | def signal_handler(signal, frame): 29 | audio_queue.queue.clear() 30 | audio_queue.put(None) 31 | sd.stop() 32 | sys.exit(0) 33 | 34 | signal.signal(signal.SIGINT, signal_handler) 35 | 36 | def play_sentence(sentence, voice, alpha, beta, diffusion_steps, embedding_scale, seed): 37 | _sentence = sentence 38 | # if len(sentence) > 50: 39 | # _sentence = sentence[:50] + '...' 40 | print(f'Generating "{_sentence}"', end='') 41 | 42 | start = time.time() 43 | data = { 44 | "sessionId": session_id, 45 | "text": sentence, 46 | "voice": voice, 47 | "alpha": alpha, 48 | "beta": beta, 49 | "diffusion_steps": diffusion_steps, 50 | "embedding_scale": embedding_scale, 51 | "seed": seed 52 | } 53 | 54 | try: 55 | response = requests.post(f'{url}/tts', headers={'Accept': 'audio/wav'}, json=data) 56 | except Exception as e: 57 | print(e) 58 | return 59 | 60 | finish = time.time() 61 | print(f" {finish - start} seconds") 62 | 63 | wav = np.frombuffer(response.content, dtype=np.int16) 64 | 65 | wav = wav[50:] # there's a click in the beginning, so remove it 66 | audio_queue.put(wav) 67 | 68 | 69 | def main(): 70 | global url 71 | global session_id 72 | 73 | parser = argparse.ArgumentParser(description='Send TTS requests to a server and play the resulting audio.') 74 | parser.add_argument('passage', help='Passage to convert to speech') 75 | parser.add_argument('--url', help='Server base URL') 76 | parser.add_argument('--sessionId', help='Session identifier') 77 | parser.add_argument('--voice', default='en-f-1', help='Reference voice / style') 78 | parser.add_argument('--alpha', type=float, default=0.3, help='Alpha parameter') 79 | parser.add_argument('--beta', type=float, default=0.7, help='Beta parameter') 80 | parser.add_argument('--diffusion_steps', type=int, default=25, help='Number of diffusion steps') 81 | parser.add_argument('--embedding_scale', type=float, default=2, help='Embedding scale') 82 | parser.add_argument('--seed', type=int, default=None, help='Random seed') 83 | 84 | args = parser.parse_args() 85 | 86 | # Queue audio to be played in a separate thread. 87 | thread = threading.Thread(target=play_audio) 88 | thread.start() 89 | 90 | if args.sessionId is not None: 91 | session_id = args.sessionId 92 | 93 | if args.url is not None: 94 | url = args.url 95 | 96 | if (args.seed is None): 97 | args.seed = session_id 98 | 99 | if args.passage == '-': 100 | args.passage = sys.stdin.read() 101 | 102 | sentences = sentence_detector.tokenize(args.passage.strip()) 103 | 104 | for sentence in sentences: 105 | # if the sentence is too long, split it by punctuation, preserving the punctuation, then call play_sentence on each 106 | if len(sentence) > 250: 107 | split_sentences = sentence_detector.tokenize(sentence) 108 | for split_sentence in split_sentences: 109 | play_sentence( 110 | split_sentence, 111 | args.voice, 112 | args.alpha, 113 | args.beta, 114 | args.diffusion_steps, 115 | args.embedding_scale, 116 | args.seed 117 | ) 118 | else: 119 | play_sentence( 120 | sentence, 121 | args.voice, 122 | args.alpha, 123 | args.beta, 124 | args.diffusion_steps, 125 | args.embedding_scale, 126 | args.seed 127 | ) 128 | 129 | audio_queue.put(None) 130 | 131 | if __name__ == "__main__": 132 | main() 133 | -------------------------------------------------------------------------------- /tts_server/server.py: -------------------------------------------------------------------------------- 1 | import queue 2 | import threading 3 | from flask import Flask, request, jsonify, send_file 4 | import os 5 | import io 6 | import soundfile as sf 7 | import argparse 8 | from styletts2 import TTS 9 | import subprocess 10 | import traceback 11 | import random 12 | 13 | app = Flask(__name__) 14 | fdir = os.path.dirname(__file__) 15 | 16 | # Cache for voices 17 | voices = {} 18 | prev_s_db = {} 19 | 20 | tts = None 21 | 22 | 23 | class HotProcessQueue: 24 | def __init__(self, *popen_args, **popen_kwargs): 25 | self.popen_args = popen_args 26 | self.popen_kwargs = popen_kwargs 27 | self.process_queue = queue.Queue() 28 | self.lock = threading.Lock() 29 | # Initial process spawning 30 | self.spawn_new_process() 31 | 32 | def spawn_new_process(self): 33 | # Spawn a new process and add it to the queue 34 | process = subprocess.Popen( 35 | *self.popen_args, 36 | **self.popen_kwargs, 37 | stdin=subprocess.PIPE, 38 | stdout=subprocess.PIPE, 39 | stderr=subprocess.PIPE 40 | ) 41 | self.process_queue.put(process) 42 | 43 | def consume(self, input_string): 44 | # Wait for an available process 45 | process = self.process_queue.get() 46 | 47 | # Communicate with the process 48 | stdout, stderr = process.communicate(input=input_string.encode()) 49 | process.terminate() 50 | 51 | # Spawn a new process asynchronously 52 | threading.Thread(target=self.spawn_new_process).start() 53 | 54 | return stdout.decode() 55 | 56 | def __del__(self): 57 | # Cleanup: close all processes 58 | while not self.process_queue.empty(): 59 | process = self.process_queue.get() 60 | process.terminate() 61 | 62 | 63 | class Phonemizer: 64 | def __init__(self, language='en-us'): 65 | self.language = language 66 | self.queue = HotProcessQueue([ 67 | 'phonemize', 68 | '--preserve-punctuation', '--with-stress', '--language', self.language 69 | ]) 70 | 71 | def phonemize(self, texts): 72 | try: 73 | passage = texts[0] 74 | output = self.queue.consume(passage) 75 | return [output] 76 | except Exception as e: 77 | print(e) 78 | return None 79 | 80 | 81 | phonemizer = Phonemizer() 82 | 83 | 84 | class ServerException(Exception): 85 | def __init__(self, message, status_code): 86 | super().__init__(message) 87 | self.status_code = status_code 88 | self.message = message 89 | 90 | 91 | @app.errorhandler(ServerException) 92 | def handle_server_exception(error): 93 | traceback.print_exc() 94 | response = jsonify({"error": error.message}) 95 | response.status_code = error.status_code 96 | return response 97 | 98 | 99 | @app.errorhandler(404) 100 | def not_found(error): 101 | return jsonify({'error': 'Not found'}), 404 102 | 103 | 104 | def compute_voice(voice_file_path): 105 | print(f'Computing style for {voice_file_path}') 106 | voice = tts.compute_style(voice_file_path) 107 | return voice 108 | 109 | 110 | @app.route('/voices', methods=['GET']) 111 | def get_voices(): 112 | return jsonify(list(voices.keys())) 113 | 114 | 115 | @app.route('/session/new', methods=['POST']) 116 | def new_session(): 117 | try: 118 | # random session id between 0 and 2^32 119 | session_id = random.randint(0, 4294967295) 120 | 121 | if 'voice' in request.files and request.files['voice'].filename != '': 122 | voice_file = request.files['voice'] 123 | voice_name = os.path.splitext(voice_file.filename)[0] 124 | 125 | # Read file into memory 126 | voice_data = io.BytesIO() 127 | voice_file.save(voice_data) 128 | voice_data.seek(0) 129 | 130 | # Convert style to wav using FFmpeg, reading from and writing to memory 131 | process = subprocess.Popen( 132 | ['ffmpeg', '-i', '-', '-ac', '1', '-ar', '24000', '-f', 'wav', '-'], 133 | stdin=subprocess.PIPE, 134 | stdout=subprocess.PIPE, 135 | stderr=subprocess.PIPE) 136 | stdout, stderr = process.communicate(input=voice_data.read()) 137 | 138 | # Check for errors 139 | if process.returncode != 0: 140 | return jsonify({"error": "FFmpeg error", "cause": stderr}), 500 141 | 142 | # Convert stdout bytes to a BytesIO object 143 | style_wav_data = io.BytesIO(stdout) 144 | 145 | # Compute style and store in voices dictionary 146 | voices[session_id] = compute_voice(style_wav_data) 147 | else: 148 | voice_name = request.form['voice'] 149 | if voice_name == '': 150 | voice_name = 'en-f-1' 151 | voice_file_path = os.path.join(fdir, 'voices', f'{voice_name}.wav') 152 | voice = compute_voice(voice_file_path) 153 | voices[session_id] = voice 154 | voices[voice_name] = voice 155 | 156 | return jsonify({ 157 | "message": "New session created", 158 | "voice": voice_name, 159 | "session_id": session_id 160 | }) 161 | 162 | except Exception as e: 163 | traceback.print_exc() 164 | return jsonify({"error": "Failed to create new session"}), 500 165 | 166 | 167 | @app.route('/tts', methods=['POST']) 168 | def text_to_speech(): 169 | try: 170 | # Accept header validation 171 | if request.headers.get('Accept') != 'audio/wav': 172 | raise ServerException("Only audio/wav is supported", 406) 173 | 174 | data = request.json 175 | session_id = int(data.get('sessionId', -1)) # -1 if not found 176 | 177 | # Session ID validation 178 | if session_id < 0 or session_id > 4294967295: 179 | raise ServerException(f"Invalid session ID: {session_id}", 400) 180 | 181 | # Compute or retrieve style 182 | if session_id not in voices: 183 | voice_name = data.get('voice', 'en-f-1') 184 | if voice_name in voices: 185 | voice = voices[voice_name] 186 | else: 187 | voice_file_path = os.path.join( 188 | fdir, 'voices', f'{voice_name}.wav') 189 | voice = compute_voice(voice_file_path) 190 | voices[session_id] = voice 191 | voices[voice_name] = voice 192 | else: 193 | voice = voices[session_id] 194 | 195 | text = data.get('text') 196 | if not text: 197 | raise ServerException("Text is required", 400) 198 | 199 | seed = data.get('seed', session_id) 200 | if (seed == -1): 201 | seed = random.randint(0, 4294967295) 202 | tts.set_seed(int(seed)) 203 | 204 | # Generate audio 205 | wav, prev_s = tts.inference( 206 | text, 207 | voice, 208 | phonemizer=phonemizer, 209 | prev_s=prev_s_db.get(session_id, None), 210 | alpha=data.get('alpha', 0.3), 211 | beta=data.get('beta', 0.7), 212 | diffusion_steps=data.get('diffusion_steps', 20), 213 | embedding_scale=data.get('embedding_scale', 1.5) 214 | ) 215 | 216 | # Update previous voice state 217 | prev_s_db[session_id] = prev_s 218 | 219 | byte_io = io.BytesIO() 220 | sf.write(byte_io, wav, 24000, format='WAV') 221 | byte_io.seek(0) 222 | return send_file(byte_io, mimetype='audio/wav') 223 | 224 | except ServerException as e: 225 | raise e 226 | 227 | except Exception as e: 228 | raise ServerException(str(e), 500) 229 | 230 | 231 | def main(): 232 | global tts 233 | parser = argparse.ArgumentParser( 234 | description='Run the Text-to-Speech Server') 235 | 236 | parser.add_argument('--debug', action='store_true', 237 | help='Enable debug mode') 238 | parser.add_argument('--host', default='127.0.0.1', 239 | help='Host IP address (default: 127.0.0.1)') 240 | parser.add_argument('--port', type=int, default=5050, 241 | help='Port number (default: 5050)') 242 | 243 | args = parser.parse_args() 244 | 245 | tts = TTS.load_model( 246 | config_path="hf://yl4579/StyleTTS2-LibriTTS/Models/LibriTTS/config.yml", 247 | checkpoint_path="hf://yl4579/StyleTTS2-LibriTTS/Models/LibriTTS/epochs_2nd_00020.pth" 248 | ) 249 | 250 | app.run(debug=args.debug, host=args.host, port=args.port) 251 | 252 | 253 | if __name__ == '__main__': 254 | main() 255 | -------------------------------------------------------------------------------- /tts_server/voices/en-f-1.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-f-1.wav -------------------------------------------------------------------------------- /tts_server/voices/en-f-2.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-f-2.wav -------------------------------------------------------------------------------- /tts_server/voices/en-f-3.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-f-3.wav -------------------------------------------------------------------------------- /tts_server/voices/en-m-1.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-m-1.wav -------------------------------------------------------------------------------- /tts_server/voices/en-m-2.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-m-2.wav -------------------------------------------------------------------------------- /tts_server/voices/en-m-3.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-m-3.wav --------------------------------------------------------------------------------