├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── requirements.txt
├── setup.py
└── tts_server
    ├── __init__.py
    ├── cli.py
    ├── server.py
    └── voices
        ├── en-f-1.wav
        ├── en-f-2.wav
        ├── en-f-3.wav
        ├── en-m-1.wav
        ├── en-m-2.wav
        └── en-m-3.wav


/.gitignore:
--------------------------------------------------------------------------------
1 | .conda-env/
2 | sessions/
3 | build/
4 | *.egg-info
5 | __pycache__
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2023 Aleksey Smolenchuk
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | recursive-include tts_server *
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🎤 TTS-Server
  2 | 
  3 | ![](https://img.shields.io/badge/no-bugs-brightgreen.svg) ![](https://img.shields.io/badge/coverage-%F0%9F%92%AF-green.svg)
  4 | 
  5 | This project is a Text-to-Speech (TTS) server implemented in Python using Flask. It utilizes [StyleTTS2](https://github.com/yl4579/StyleTTS2) for high-quality speech synthesis. The server facilitates the generation of speech from text with various voice styles and parameters, and manages a queue for processing TTS requests.
  6 | 
  7 | ## Features
  8 | 
  9 | - Streaming generation allows for long-form narration
 10 | - Upload your own voice style files
 11 | - Voice cache improves generation performance
 12 | - No direct dependency on espeak (but it's required to be installed on your system -- see prerequisites)
 13 | - CLI client
 14 | 
 15 | ## Prerequisites
 16 | 
 17 | You need torch and [phonemizer](https://pypi.org/project/phonemizer/) installed on your system / environment:
 18 | 
 19 | ```bash
 20 | pip install phonemizer
 21 | ```
 22 | 
 23 | ## Starting The Server
 24 | 
 25 | Install the server a
 26 | 
 27 | ```bash
 28 | pip install git+https://github.com/lxe/tts-server.git
 29 | ```
 30 | 
 31 | Run the server:
 32 | 
 33 | ```bash
 34 | python -m tts_server.server
 35 | ```
 36 | 
 37 | You can use `--host <host>` and `--port <port>` to specify the server's address and port. `--help` for more options.
 38 | 
 39 | 
 40 | ## HTTP API
 41 | 
 42 | ### Create New Session
 43 | - **Method**: POST
 44 | - **URL**: `/session/new`
 45 | - **Request** (Form-data or JSON):
 46 |   - `voice` (file, optional): Upload a voice file.
 47 |   - `voice` (string, optional): Predefined voice name.
 48 | - **Response** (JSON):
 49 |   - `message`: Confirmation message.
 50 |   - `voice`: Voice used in the session.
 51 |   - `session_id`: Generated session ID.
 52 | 
 53 | ### Perform TTS
 54 | - **Method**: POST
 55 | - **URL**: `/tts`
 56 | - **Request** (JSON):
 57 |   - `sessionId` (integer, required): Session identifier.
 58 |   - `text` (string, required): Text to be synthesized.
 59 |   - `alpha`, `beta`, `diffusion_steps`, `embedding_scale` (floats/integers, optional): TTS parameters.
 60 | - **Response** (audio/wav):
 61 |   - Binary WAV file.
 62 | 
 63 | ### Get Available Voices
 64 | - **Method**: GET
 65 | - **URL**: `/voices`
 66 | - **Response** (JSON):
 67 |   - List of available voice keys.
 68 | 
 69 | ### Error Handling
 70 | - Custom error responses for various server exceptions.
 71 | 
 72 | ## CLI Client
 73 | 
 74 | The server can be interacted with via CURL commands:
 75 | 
 76 | ```bash
 77 | curl -X POST -H "Content-Type: application/json" -H "Accept: audio/wav" -d '{
 78 |   "sessionId": 12345,
 79 |   "text": "Embrace the chaos and let your words dance to the rhythm of imagination!",
 80 |   "alpha": 0.2,
 81 |   "beta": 0.4,
 82 |   "diffusion_steps": 10,
 83 |   "embedding_scale": 1.5
 84 | }' "http://localhost:5050/tts" | aplay
 85 | ```
 86 | 
 87 | ...or the CLI client...
 88 | 
 89 | ```bash
 90 | python -m tts_server.cli "In a fantastical forest, flittering fireflies illuminate the night, casting a mesmerizing dance of light and shadow beneath the ancient, gnarled trees."
 91 | ```
 92 | 
 93 | You can pass longer text to the cli for long-form narration:
 94 | 
 95 | ```bash
 96 | echo "As the fireflies twinkle in harmonious rhythm, their gentle glow reveals the secrets of the woodland. Tiny creatures, hidden from sight by day, emerge to partake in this nocturnal spectacle. Frogs serenade with their melodic croaks, and owls, wise sentinels of the night, exchange hoots that echo through the enchanted forest." | python -m tts_server.cli -
 97 | ```
 98 | 
 99 | ## Credits
100 | 
101 | This project utilizes [StyleTTS2](https://github.com/yl4579/StyleTTS2) for its text-to-speech synthesis capabilities. Special thanks to @yl4579 for creating StyleTTS2 and @fakerybakery for the [styletts2-importable](https://github.com/fakerybakery/StyleTTS2/blob/main/styletts2/inference.py) for inspiration on `tts.py`
102 | 
103 | ## License
104 | 
105 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | flask
2 | soundfile
3 | git+https://github.com/lxe/styletts2.git
4 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | setup(
 4 |     name="tts_server",
 5 |     version="1.0.0",
 6 |     packages=find_packages(),
 7 |     include_package_data=True,
 8 |     install_requires=[
 9 |         "flask",
10 |         "styletts2 @ git+https://github.com/lxe/styletts2.git",
11 |     ],
12 |     entry_points={
13 |         "console_scripts": [
14 |             "tts-server = tts_server.server:main",
15 |             "tts-server-cli = tts_server.cli:main",
16 |         ],
17 |     },
18 | )


--------------------------------------------------------------------------------
/tts_server/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/__init__.py


--------------------------------------------------------------------------------
/tts_server/cli.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import argparse
  3 | import sounddevice as sd
  4 | import numpy as np
  5 | import time
  6 | import sys
  7 | import threading
  8 | import queue
  9 | import signal
 10 | import nltk.data
 11 | import random
 12 | import threading
 13 | 
 14 | sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
 15 | 
 16 | url = 'http://localhost:5050'
 17 | session_id = random.randint(0, 4294967295)
 18 | audio_queue = queue.Queue()
 19 | 
 20 | def play_audio():
 21 |     while True:
 22 |         wav = audio_queue.get()
 23 |         if wav is None:
 24 |             break
 25 |         sd.play(wav, 24000)
 26 |         sd.wait()
 27 | 
 28 | def signal_handler(signal, frame):
 29 |     audio_queue.queue.clear()
 30 |     audio_queue.put(None)
 31 |     sd.stop()
 32 |     sys.exit(0)
 33 | 
 34 | signal.signal(signal.SIGINT, signal_handler)
 35 | 
 36 | def play_sentence(sentence, voice, alpha, beta, diffusion_steps, embedding_scale, seed):
 37 |     _sentence = sentence
 38 |     # if len(sentence) > 50:
 39 |     #     _sentence = sentence[:50] + '...'
 40 |     print(f'Generating "{_sentence}"', end='')
 41 | 
 42 |     start = time.time()
 43 |     data = {
 44 |         "sessionId": session_id,
 45 |         "text": sentence,
 46 |         "voice": voice,
 47 |         "alpha": alpha,
 48 |         "beta": beta,
 49 |         "diffusion_steps": diffusion_steps,
 50 |         "embedding_scale": embedding_scale,
 51 |         "seed": seed
 52 |     }
 53 | 
 54 |     try:
 55 |         response = requests.post(f'{url}/tts', headers={'Accept': 'audio/wav'}, json=data)
 56 |     except Exception as e:
 57 |         print(e)
 58 |         return
 59 | 
 60 |     finish = time.time()
 61 |     print(f" {finish - start} seconds")
 62 | 
 63 |     wav = np.frombuffer(response.content, dtype=np.int16)
 64 | 
 65 |     wav = wav[50:]  # there's a click in the beginning, so remove it
 66 |     audio_queue.put(wav)
 67 | 
 68 | 
 69 | def main():
 70 |     global url
 71 |     global session_id
 72 | 
 73 |     parser = argparse.ArgumentParser(description='Send TTS requests to a server and play the resulting audio.')
 74 |     parser.add_argument('passage', help='Passage to convert to speech')
 75 |     parser.add_argument('--url', help='Server base URL')
 76 |     parser.add_argument('--sessionId', help='Session identifier')
 77 |     parser.add_argument('--voice', default='en-f-1', help='Reference voice / style')
 78 |     parser.add_argument('--alpha', type=float, default=0.3, help='Alpha parameter')
 79 |     parser.add_argument('--beta', type=float, default=0.7, help='Beta parameter')
 80 |     parser.add_argument('--diffusion_steps', type=int, default=25, help='Number of diffusion steps')
 81 |     parser.add_argument('--embedding_scale', type=float, default=2, help='Embedding scale')
 82 |     parser.add_argument('--seed', type=int, default=None, help='Random seed')
 83 | 
 84 |     args = parser.parse_args()
 85 | 
 86 |     # Queue audio to be played in a separate thread.
 87 |     thread = threading.Thread(target=play_audio)
 88 |     thread.start()
 89 | 
 90 |     if args.sessionId is not None:
 91 |         session_id = args.sessionId
 92 | 
 93 |     if args.url is not None:
 94 |         url = args.url
 95 |         
 96 |     if (args.seed is None):
 97 |         args.seed = session_id  
 98 | 
 99 |     if args.passage == '-':
100 |         args.passage = sys.stdin.read()
101 | 
102 |     sentences = sentence_detector.tokenize(args.passage.strip())
103 | 
104 |     for sentence in sentences:
105 |         # if the sentence is too long, split it by punctuation, preserving the punctuation, then call play_sentence on each
106 |         if len(sentence) > 250:
107 |             split_sentences = sentence_detector.tokenize(sentence)
108 |             for split_sentence in split_sentences:
109 |                 play_sentence(
110 |                     split_sentence,
111 |                     args.voice,
112 |                     args.alpha,
113 |                     args.beta,
114 |                     args.diffusion_steps,
115 |                     args.embedding_scale,
116 |                     args.seed
117 |                 )
118 |         else:
119 |             play_sentence(
120 |                 sentence,
121 |                 args.voice,
122 |                 args.alpha,
123 |                 args.beta,
124 |                 args.diffusion_steps,
125 |                 args.embedding_scale,
126 |                 args.seed
127 |             )
128 | 
129 |     audio_queue.put(None)
130 | 
131 | if __name__ == "__main__":
132 |     main()
133 | 


--------------------------------------------------------------------------------
/tts_server/server.py:
--------------------------------------------------------------------------------
  1 | import queue
  2 | import threading
  3 | from flask import Flask, request, jsonify, send_file
  4 | import os
  5 | import io
  6 | import soundfile as sf
  7 | import argparse
  8 | from styletts2 import TTS
  9 | import subprocess
 10 | import traceback
 11 | import random
 12 | 
 13 | app = Flask(__name__)
 14 | fdir = os.path.dirname(__file__)
 15 | 
 16 | # Cache for voices
 17 | voices = {}
 18 | prev_s_db = {}
 19 | 
 20 | tts = None
 21 | 
 22 | 
 23 | class HotProcessQueue:
 24 |     def __init__(self, *popen_args, **popen_kwargs):
 25 |         self.popen_args = popen_args
 26 |         self.popen_kwargs = popen_kwargs
 27 |         self.process_queue = queue.Queue()
 28 |         self.lock = threading.Lock()
 29 |         # Initial process spawning
 30 |         self.spawn_new_process()
 31 | 
 32 |     def spawn_new_process(self):
 33 |         # Spawn a new process and add it to the queue
 34 |         process = subprocess.Popen(
 35 |             *self.popen_args,
 36 |             **self.popen_kwargs,
 37 |             stdin=subprocess.PIPE,
 38 |             stdout=subprocess.PIPE,
 39 |             stderr=subprocess.PIPE
 40 |         )
 41 |         self.process_queue.put(process)
 42 | 
 43 |     def consume(self, input_string):
 44 |         # Wait for an available process
 45 |         process = self.process_queue.get()
 46 | 
 47 |         # Communicate with the process
 48 |         stdout, stderr = process.communicate(input=input_string.encode())
 49 |         process.terminate()
 50 | 
 51 |         # Spawn a new process asynchronously
 52 |         threading.Thread(target=self.spawn_new_process).start()
 53 | 
 54 |         return stdout.decode()
 55 | 
 56 |     def __del__(self):
 57 |         # Cleanup: close all processes
 58 |         while not self.process_queue.empty():
 59 |             process = self.process_queue.get()
 60 |             process.terminate()
 61 | 
 62 | 
 63 | class Phonemizer:
 64 |     def __init__(self, language='en-us'):
 65 |         self.language = language
 66 |         self.queue = HotProcessQueue([
 67 |             'phonemize',
 68 |             '--preserve-punctuation', '--with-stress', '--language', self.language
 69 |         ])
 70 | 
 71 |     def phonemize(self, texts):
 72 |         try:
 73 |             passage = texts[0]
 74 |             output = self.queue.consume(passage)
 75 |             return [output]
 76 |         except Exception as e:
 77 |             print(e)
 78 |             return None
 79 | 
 80 | 
 81 | phonemizer = Phonemizer()
 82 | 
 83 | 
 84 | class ServerException(Exception):
 85 |     def __init__(self, message, status_code):
 86 |         super().__init__(message)
 87 |         self.status_code = status_code
 88 |         self.message = message
 89 | 
 90 | 
 91 | @app.errorhandler(ServerException)
 92 | def handle_server_exception(error):
 93 |     traceback.print_exc()
 94 |     response = jsonify({"error": error.message})
 95 |     response.status_code = error.status_code
 96 |     return response
 97 | 
 98 | 
 99 | @app.errorhandler(404)
100 | def not_found(error):
101 |     return jsonify({'error': 'Not found'}), 404
102 | 
103 | 
104 | def compute_voice(voice_file_path):
105 |     print(f'Computing style for {voice_file_path}')
106 |     voice = tts.compute_style(voice_file_path)
107 |     return voice
108 | 
109 | 
110 | @app.route('/voices', methods=['GET'])
111 | def get_voices():
112 |     return jsonify(list(voices.keys()))
113 | 
114 | 
115 | @app.route('/session/new', methods=['POST'])
116 | def new_session():
117 |     try:
118 |         # random session id between 0 and 2^32
119 |         session_id = random.randint(0, 4294967295)
120 | 
121 |         if 'voice' in request.files and request.files['voice'].filename != '':
122 |             voice_file = request.files['voice']
123 |             voice_name = os.path.splitext(voice_file.filename)[0]
124 | 
125 |             # Read file into memory
126 |             voice_data = io.BytesIO()
127 |             voice_file.save(voice_data)
128 |             voice_data.seek(0)
129 | 
130 |             # Convert style to wav using FFmpeg, reading from and writing to memory
131 |             process = subprocess.Popen(
132 |                 ['ffmpeg', '-i', '-', '-ac', '1', '-ar', '24000', '-f', 'wav', '-'],
133 |                 stdin=subprocess.PIPE,
134 |                 stdout=subprocess.PIPE,
135 |                 stderr=subprocess.PIPE)
136 |             stdout, stderr = process.communicate(input=voice_data.read())
137 | 
138 |             # Check for errors
139 |             if process.returncode != 0:
140 |                 return jsonify({"error": "FFmpeg error", "cause": stderr}), 500
141 | 
142 |             # Convert stdout bytes to a BytesIO object
143 |             style_wav_data = io.BytesIO(stdout)
144 | 
145 |             # Compute style and store in voices dictionary
146 |             voices[session_id] = compute_voice(style_wav_data)
147 |         else:
148 |             voice_name = request.form['voice']
149 |             if voice_name == '':
150 |                 voice_name = 'en-f-1'
151 |             voice_file_path = os.path.join(fdir, 'voices', f'{voice_name}.wav')
152 |             voice = compute_voice(voice_file_path)
153 |             voices[session_id] = voice
154 |             voices[voice_name] = voice
155 | 
156 |         return jsonify({
157 |             "message": "New session created",
158 |             "voice": voice_name,
159 |             "session_id": session_id
160 |         })
161 | 
162 |     except Exception as e:
163 |         traceback.print_exc()
164 |         return jsonify({"error": "Failed to create new session"}), 500
165 | 
166 | 
167 | @app.route('/tts', methods=['POST'])
168 | def text_to_speech():
169 |     try:
170 |         # Accept header validation
171 |         if request.headers.get('Accept') != 'audio/wav':
172 |             raise ServerException("Only audio/wav is supported", 406)
173 | 
174 |         data = request.json
175 |         session_id = int(data.get('sessionId', -1))  # -1 if not found
176 | 
177 |         # Session ID validation
178 |         if session_id < 0 or session_id > 4294967295:
179 |             raise ServerException(f"Invalid session ID: {session_id}", 400)
180 | 
181 |         # Compute or retrieve style
182 |         if session_id not in voices:
183 |             voice_name = data.get('voice', 'en-f-1')
184 |             if voice_name in voices:
185 |                 voice = voices[voice_name]
186 |             else:
187 |                 voice_file_path = os.path.join(
188 |                     fdir, 'voices', f'{voice_name}.wav')
189 |                 voice = compute_voice(voice_file_path)
190 |                 voices[session_id] = voice
191 |                 voices[voice_name] = voice
192 |         else:
193 |             voice = voices[session_id]
194 | 
195 |         text = data.get('text')
196 |         if not text:
197 |             raise ServerException("Text is required", 400)
198 | 
199 |         seed = data.get('seed', session_id)
200 |         if (seed == -1):
201 |             seed = random.randint(0, 4294967295)
202 |         tts.set_seed(int(seed))
203 | 
204 |         # Generate audio
205 |         wav, prev_s = tts.inference(
206 |             text,
207 |             voice,
208 |             phonemizer=phonemizer,
209 |             prev_s=prev_s_db.get(session_id, None),
210 |             alpha=data.get('alpha', 0.3),
211 |             beta=data.get('beta', 0.7),
212 |             diffusion_steps=data.get('diffusion_steps', 20),
213 |             embedding_scale=data.get('embedding_scale', 1.5)
214 |         )
215 | 
216 |         # Update previous voice state
217 |         prev_s_db[session_id] = prev_s
218 | 
219 |         byte_io = io.BytesIO()
220 |         sf.write(byte_io, wav, 24000, format='WAV')
221 |         byte_io.seek(0)
222 |         return send_file(byte_io, mimetype='audio/wav')
223 | 
224 |     except ServerException as e:
225 |         raise e
226 | 
227 |     except Exception as e:
228 |         raise ServerException(str(e), 500)
229 | 
230 | 
231 | def main():
232 |     global tts
233 |     parser = argparse.ArgumentParser(
234 |         description='Run the Text-to-Speech Server')
235 | 
236 |     parser.add_argument('--debug', action='store_true',
237 |                         help='Enable debug mode')
238 |     parser.add_argument('--host', default='127.0.0.1',
239 |                         help='Host IP address (default: 127.0.0.1)')
240 |     parser.add_argument('--port', type=int, default=5050,
241 |                         help='Port number (default: 5050)')
242 | 
243 |     args = parser.parse_args()
244 | 
245 |     tts = TTS.load_model(
246 |         config_path="hf://yl4579/StyleTTS2-LibriTTS/Models/LibriTTS/config.yml",
247 |         checkpoint_path="hf://yl4579/StyleTTS2-LibriTTS/Models/LibriTTS/epochs_2nd_00020.pth"
248 |     )
249 | 
250 |     app.run(debug=args.debug, host=args.host, port=args.port)
251 | 
252 | 
253 | if __name__ == '__main__':
254 |     main()
255 | 


--------------------------------------------------------------------------------
/tts_server/voices/en-f-1.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-f-1.wav


--------------------------------------------------------------------------------
/tts_server/voices/en-f-2.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-f-2.wav


--------------------------------------------------------------------------------
/tts_server/voices/en-f-3.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-f-3.wav


--------------------------------------------------------------------------------
/tts_server/voices/en-m-1.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-m-1.wav


--------------------------------------------------------------------------------
/tts_server/voices/en-m-2.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-m-2.wav


--------------------------------------------------------------------------------
/tts_server/voices/en-m-3.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxe/tts-server/a46e4fbab0fd02e1f427387fa7e02f6413ded8d4/tts_server/voices/en-m-3.wav


--------------------------------------------------------------------------------