├── .gitignore
├── LICENSE
├── README.md
├── nix
    ├── __init__.py
    ├── models
    │   └── TTS.py
    └── tokenizers
    │   └── tokenizer_en.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | # Python cache
2 | __pycache__/
3 | *.pyc
4 | .ipynb_checkpoints
5 | assets/
6 | notebooks/


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Rendi Chevi
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # **🐤 Nix-TTS**
 2 | 
 3 | ### **Lightweight and End-to-end Text-to-Speech via Module-wise Distillation**
 4 | 
 5 | #### Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti
 6 | 
 7 | This is a repository for our paper, **🐤 Nix-TTS** (Accepted to IEEE SLT 2022). We released the pretrained models, an interactive demo, and audio samples below.
 8 | 
 9 | [[📄 Paper Link](Coming Soon!)] [[🤗 Interactive Demo](https://huggingface.co/spaces/rendchevi/nix-tts)] [[📢 Audio Samples](https://anon1178.github.io/Nix-SLT-Demo/)]
10 | 
11 | **Abstract**&nbsp;&nbsp;&nbsp;&nbsp;Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34\% reduction of the teacher model; it also achieves over 3.04$\times$ and 8.36$\times$ inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model.
12 | 
13 | ## **Getting Started with Nix-TTS**
14 | **Clone the `nix-tts` repository and move to its directory**
15 | ```bash
16 | git clone https://github.com/rendchevi/nix-tts.git
17 | cd nix-tts
18 | ```
19 | 
20 | **Install the dependencies**
21 | - Install Python dependencies. We recommend `python >= 3.8`
22 | ```bash
23 | pip install -r requirements.txt 
24 | ```
25 | - Install espeak in your device (for text tokenization).
26 | ```bash
27 | sudo apt-get install espeak
28 | ```
29 | Or follow the [official instruction](https://github.com/bootphon/phonemizer#dependencies) in case it didn't work.
30 | 
31 | **Download your chosen pre-trained model [here](https://drive.google.com/drive/folders/1GbFOnJsgKHCAXySm2sTluRRikc4TAWxJ?usp=sharing)**. 
32 | 
33 | | Model      | Num. of Params | Faster than real-time<sup>*</sup> (CPU Intel-i7) | Faster than real-time<sup>*</sup> (RasPi Model 3B) |
34 | | ----------  | -------------- | ----| ----|
35 | | Nix-TTS (ONNX)     | 5.23 M | 11.9x | 0.50x |
36 | | Nix-TTS w/ Stochastic Duration (ONNX) | 6.03 M | 10.8x | 0.50x |
37 | 
38 | **<sup>*</sup>** Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.
39 | 
40 | **And running Nix-TTS is as easy as:**
41 | ```py
42 | from nix.models.TTS import NixTTSInference
43 | from IPython.display import Audio
44 | 
45 | # Initiate Nix-TTS
46 | nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
47 | # Tokenize input text
48 | c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
49 | # Convert text to raw speech
50 | xw = nix.vocalize(c, c_length)
51 | 
52 | # Listen to the generated speech
53 | Audio(xw[0,0], rate = 22050)
54 | ```
55 | 
56 | ## **Acknowledgement**
57 | - This research is fully and exclusively funded by [Kata.ai](https://kata.ai), where the authors work as part of the [Kata.ai Research Team](https://kata.ai/research).
58 | - Some of the complex parts of our model, as mentioned in the paper, are adapted from the original implementation of [VITS](https://github.com/jaywalnut310/vits) and [Comprehensive-Transformer-TTS](https://github.com/keonlee9420/Comprehensive-Transformer-TTS).
59 | 


--------------------------------------------------------------------------------
/nix/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rendchevi/nix-tts/67aebacc4dcf5d12692ea5088fcb491d928f6044/nix/__init__.py


--------------------------------------------------------------------------------
/nix/models/TTS.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pickle
 3 | import timeit
 4 | 
 5 | import numpy as np
 6 | import onnxruntime as ort
 7 | 
 8 | from nix.tokenizers.tokenizer_en import NixTokenizerEN
 9 | 
10 | class NixTTSInference:
11 | 
12 |     def __init__(
13 |         self,
14 |         model_dir,
15 |     ):
16 |         # Load tokenizer
17 |         self.tokenizer = NixTokenizerEN(pickle.load(open(os.path.join(model_dir, "tokenizer_state.pkl"), "rb")))
18 |         # Load TTS model
19 |         self.encoder = ort.InferenceSession(os.path.join(model_dir, "encoder.onnx"))
20 |         self.decoder = ort.InferenceSession(os.path.join(model_dir, "decoder.onnx"))
21 | 
22 |     def tokenize(
23 |         self,
24 |         text,
25 |     ):
26 |         # Tokenize input text
27 |         c, c_lengths, phonemes = self.tokenizer([text])
28 | 
29 |         return np.array(c, dtype = np.int64), np.array(c_lengths, dtype = np.int64), phonemes
30 | 
31 |     def vocalize(
32 |         self,
33 |         c,
34 |         c_lengths,
35 |     ):
36 |         """
37 |         Single-batch TTS inference
38 |         """
39 |         # Infer latent samples from encoder
40 |         z = self.encoder.run(
41 |             None,
42 |             {
43 |                 "c": c,
44 |                 "c_lengths": c_lengths,
45 |             }
46 |         )[2]
47 |         # Decode raw audio with decoder
48 |         xw = self.decoder.run(
49 |             None,
50 |             {
51 |                 "z": z,
52 |             }
53 |         )[0]
54 | 
55 |         return xw
56 | 


--------------------------------------------------------------------------------
/nix/tokenizers/tokenizer_en.py:
--------------------------------------------------------------------------------
 1 | # Regex
 2 | import re
 3 | 
 4 | # Phonemizer
 5 | from phonemizer.backend import EspeakBackend
 6 | phonemizer_backend = EspeakBackend(
 7 |     language = 'en-us',
 8 |     preserve_punctuation = True,
 9 |     with_stress = True
10 | )
11 | 
12 | class NixTokenizerEN:
13 | 
14 |     def __init__(
15 |         self,
16 |         tokenizer_state,
17 |     ):
18 |         # Vocab and abbreviations dictionary
19 |         self.vocab_dict = tokenizer_state["vocab_dict"]
20 |         self.abbreviations_dict = tokenizer_state["abbreviations_dict"]
21 | 
22 |         # Regex recipe
23 |         self.whitespace_regex = tokenizer_state["whitespace_regex"]
24 |         self.abbreviations_regex = tokenizer_state["abbreviations_regex"]
25 | 
26 |     def __call__(
27 |         self,
28 |         texts,
29 |     ):
30 |         # 1. Phonemize input texts
31 |         phonemes = [ self._collapse_whitespace(
32 |             phonemizer_backend.phonemize(
33 |                 self._expand_abbreviations(text.lower()),
34 |                 strip = True,
35 |             )
36 |         ) for text in texts ]
37 | 
38 |         # 2. Tokenize phonemes
39 |         tokens = [ self._intersperse([self.vocab_dict[p] for p in phoneme], 0) for phoneme in phonemes ]
40 | 
41 |         # 3. Pad tokens
42 |         tokens, tokens_lengths = self._pad_tokens(tokens)
43 | 
44 |         return tokens, tokens_lengths, phonemes
45 | 
46 |     def _expand_abbreviations(
47 |         self,
48 |         text
49 |     ):
50 |         for regex, replacement in self.abbreviations_regex:
51 |             text = re.sub(regex, replacement, text)
52 | 
53 |         return text
54 | 
55 |     def _collapse_whitespace(
56 |         self,
57 |         text
58 |     ):
59 |         return re.sub(self.whitespace_regex, ' ', text)
60 | 
61 |     def _intersperse(
62 |         self,
63 |         lst,
64 |         item,
65 |     ):
66 |         result = [item] * (len(lst) * 2 + 1)
67 |         result[1::2] = lst
68 |         return result
69 | 
70 |     def _pad_tokens(
71 |         self,
72 |         tokens,
73 |     ):
74 |         tokens_lengths = [len(token) for token in tokens]
75 |         max_len = max(tokens_lengths)
76 |         tokens = [token + [0 for _ in range(max_len - len(token))] for token in tokens]
77 |         return tokens, tokens_lengths


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | onnxruntime==1.7.0
3 | phonemizer==2.2.1
4 | 


--------------------------------------------------------------------------------