├── .gitignore ├── LICENSE ├── README.md ├── nix ├── __init__.py ├── models │ └── TTS.py └── tokenizers │ └── tokenizer_en.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Python cache 2 | __pycache__/ 3 | *.pyc 4 | .ipynb_checkpoints 5 | assets/ 6 | notebooks/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Rendi Chevi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **🐤 Nix-TTS** 2 | 3 | ### **Lightweight and End-to-end Text-to-Speech via Module-wise Distillation** 4 | 5 | #### Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti 6 | 7 | This is a repository for our paper, **🐤 Nix-TTS** (Accepted to IEEE SLT 2022). We released the pretrained models, an interactive demo, and audio samples below. 8 | 9 | [[📄 Paper Link](Coming Soon!)] [[🤗 Interactive Demo](https://huggingface.co/spaces/rendchevi/nix-tts)] [[📢 Audio Samples](https://anon1178.github.io/Nix-SLT-Demo/)] 10 | 11 | **Abstract**    Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34\% reduction of the teacher model; it also achieves over 3.04$\times$ and 8.36$\times$ inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model. 12 | 13 | ## **Getting Started with Nix-TTS** 14 | **Clone the `nix-tts` repository and move to its directory** 15 | ```bash 16 | git clone https://github.com/rendchevi/nix-tts.git 17 | cd nix-tts 18 | ``` 19 | 20 | **Install the dependencies** 21 | - Install Python dependencies. We recommend `python >= 3.8` 22 | ```bash 23 | pip install -r requirements.txt 24 | ``` 25 | - Install espeak in your device (for text tokenization). 26 | ```bash 27 | sudo apt-get install espeak 28 | ``` 29 | Or follow the [official instruction](https://github.com/bootphon/phonemizer#dependencies) in case it didn't work. 30 | 31 | **Download your chosen pre-trained model [here](https://drive.google.com/drive/folders/1GbFOnJsgKHCAXySm2sTluRRikc4TAWxJ?usp=sharing)**. 32 | 33 | | Model | Num. of Params | Faster than real-time* (CPU Intel-i7) | Faster than real-time* (RasPi Model 3B) | 34 | | ---------- | -------------- | ----| ----| 35 | | Nix-TTS (ONNX) | 5.23 M | 11.9x | 0.50x | 36 | | Nix-TTS w/ Stochastic Duration (ONNX) | 6.03 M | 10.8x | 0.50x | 37 | 38 | ***** Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper. 39 | 40 | **And running Nix-TTS is as easy as:** 41 | ```py 42 | from nix.models.TTS import NixTTSInference 43 | from IPython.display import Audio 44 | 45 | # Initiate Nix-TTS 46 | nix = NixTTSInference(model_dir = "") 47 | # Tokenize input text 48 | c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.") 49 | # Convert text to raw speech 50 | xw = nix.vocalize(c, c_length) 51 | 52 | # Listen to the generated speech 53 | Audio(xw[0,0], rate = 22050) 54 | ``` 55 | 56 | ## **Acknowledgement** 57 | - This research is fully and exclusively funded by [Kata.ai](https://kata.ai), where the authors work as part of the [Kata.ai Research Team](https://kata.ai/research). 58 | - Some of the complex parts of our model, as mentioned in the paper, are adapted from the original implementation of [VITS](https://github.com/jaywalnut310/vits) and [Comprehensive-Transformer-TTS](https://github.com/keonlee9420/Comprehensive-Transformer-TTS). 59 | -------------------------------------------------------------------------------- /nix/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rendchevi/nix-tts/67aebacc4dcf5d12692ea5088fcb491d928f6044/nix/__init__.py -------------------------------------------------------------------------------- /nix/models/TTS.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import timeit 4 | 5 | import numpy as np 6 | import onnxruntime as ort 7 | 8 | from nix.tokenizers.tokenizer_en import NixTokenizerEN 9 | 10 | class NixTTSInference: 11 | 12 | def __init__( 13 | self, 14 | model_dir, 15 | ): 16 | # Load tokenizer 17 | self.tokenizer = NixTokenizerEN(pickle.load(open(os.path.join(model_dir, "tokenizer_state.pkl"), "rb"))) 18 | # Load TTS model 19 | self.encoder = ort.InferenceSession(os.path.join(model_dir, "encoder.onnx")) 20 | self.decoder = ort.InferenceSession(os.path.join(model_dir, "decoder.onnx")) 21 | 22 | def tokenize( 23 | self, 24 | text, 25 | ): 26 | # Tokenize input text 27 | c, c_lengths, phonemes = self.tokenizer([text]) 28 | 29 | return np.array(c, dtype = np.int64), np.array(c_lengths, dtype = np.int64), phonemes 30 | 31 | def vocalize( 32 | self, 33 | c, 34 | c_lengths, 35 | ): 36 | """ 37 | Single-batch TTS inference 38 | """ 39 | # Infer latent samples from encoder 40 | z = self.encoder.run( 41 | None, 42 | { 43 | "c": c, 44 | "c_lengths": c_lengths, 45 | } 46 | )[2] 47 | # Decode raw audio with decoder 48 | xw = self.decoder.run( 49 | None, 50 | { 51 | "z": z, 52 | } 53 | )[0] 54 | 55 | return xw 56 | -------------------------------------------------------------------------------- /nix/tokenizers/tokenizer_en.py: -------------------------------------------------------------------------------- 1 | # Regex 2 | import re 3 | 4 | # Phonemizer 5 | from phonemizer.backend import EspeakBackend 6 | phonemizer_backend = EspeakBackend( 7 | language = 'en-us', 8 | preserve_punctuation = True, 9 | with_stress = True 10 | ) 11 | 12 | class NixTokenizerEN: 13 | 14 | def __init__( 15 | self, 16 | tokenizer_state, 17 | ): 18 | # Vocab and abbreviations dictionary 19 | self.vocab_dict = tokenizer_state["vocab_dict"] 20 | self.abbreviations_dict = tokenizer_state["abbreviations_dict"] 21 | 22 | # Regex recipe 23 | self.whitespace_regex = tokenizer_state["whitespace_regex"] 24 | self.abbreviations_regex = tokenizer_state["abbreviations_regex"] 25 | 26 | def __call__( 27 | self, 28 | texts, 29 | ): 30 | # 1. Phonemize input texts 31 | phonemes = [ self._collapse_whitespace( 32 | phonemizer_backend.phonemize( 33 | self._expand_abbreviations(text.lower()), 34 | strip = True, 35 | ) 36 | ) for text in texts ] 37 | 38 | # 2. Tokenize phonemes 39 | tokens = [ self._intersperse([self.vocab_dict[p] for p in phoneme], 0) for phoneme in phonemes ] 40 | 41 | # 3. Pad tokens 42 | tokens, tokens_lengths = self._pad_tokens(tokens) 43 | 44 | return tokens, tokens_lengths, phonemes 45 | 46 | def _expand_abbreviations( 47 | self, 48 | text 49 | ): 50 | for regex, replacement in self.abbreviations_regex: 51 | text = re.sub(regex, replacement, text) 52 | 53 | return text 54 | 55 | def _collapse_whitespace( 56 | self, 57 | text 58 | ): 59 | return re.sub(self.whitespace_regex, ' ', text) 60 | 61 | def _intersperse( 62 | self, 63 | lst, 64 | item, 65 | ): 66 | result = [item] * (len(lst) * 2 + 1) 67 | result[1::2] = lst 68 | return result 69 | 70 | def _pad_tokens( 71 | self, 72 | tokens, 73 | ): 74 | tokens_lengths = [len(token) for token in tokens] 75 | max_len = max(tokens_lengths) 76 | tokens = [token + [0 for _ in range(max_len - len(token))] for token in tokens] 77 | return tokens, tokens_lengths -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | onnxruntime==1.7.0 3 | phonemizer==2.2.1 4 | --------------------------------------------------------------------------------