├── .gitignore
├── LICENSE
├── README.md
├── nix
├── __init__.py
├── models
│ └── TTS.py
└── tokenizers
│ └── tokenizer_en.py
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Python cache
2 | __pycache__/
3 | *.pyc
4 | .ipynb_checkpoints
5 | assets/
6 | notebooks/
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Rendi Chevi
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # **🐤 Nix-TTS**
2 |
3 | ### **Lightweight and End-to-end Text-to-Speech via Module-wise Distillation**
4 |
5 | #### Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti
6 |
7 | This is a repository for our paper, **🐤 Nix-TTS** (Accepted to IEEE SLT 2022). We released the pretrained models, an interactive demo, and audio samples below.
8 |
9 | [[📄 Paper Link](Coming Soon!)] [[🤗 Interactive Demo](https://huggingface.co/spaces/rendchevi/nix-tts)] [[📢 Audio Samples](https://anon1178.github.io/Nix-SLT-Demo/)]
10 |
11 | **Abstract** Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34\% reduction of the teacher model; it also achieves over 3.04$\times$ and 8.36$\times$ inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model.
12 |
13 | ## **Getting Started with Nix-TTS**
14 | **Clone the `nix-tts` repository and move to its directory**
15 | ```bash
16 | git clone https://github.com/rendchevi/nix-tts.git
17 | cd nix-tts
18 | ```
19 |
20 | **Install the dependencies**
21 | - Install Python dependencies. We recommend `python >= 3.8`
22 | ```bash
23 | pip install -r requirements.txt
24 | ```
25 | - Install espeak in your device (for text tokenization).
26 | ```bash
27 | sudo apt-get install espeak
28 | ```
29 | Or follow the [official instruction](https://github.com/bootphon/phonemizer#dependencies) in case it didn't work.
30 |
31 | **Download your chosen pre-trained model [here](https://drive.google.com/drive/folders/1GbFOnJsgKHCAXySm2sTluRRikc4TAWxJ?usp=sharing)**.
32 |
33 | | Model | Num. of Params | Faster than real-time* (CPU Intel-i7) | Faster than real-time* (RasPi Model 3B) |
34 | | ---------- | -------------- | ----| ----|
35 | | Nix-TTS (ONNX) | 5.23 M | 11.9x | 0.50x |
36 | | Nix-TTS w/ Stochastic Duration (ONNX) | 6.03 M | 10.8x | 0.50x |
37 |
38 | ***** Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.
39 |
40 | **And running Nix-TTS is as easy as:**
41 | ```py
42 | from nix.models.TTS import NixTTSInference
43 | from IPython.display import Audio
44 |
45 | # Initiate Nix-TTS
46 | nix = NixTTSInference(model_dir = "")
47 | # Tokenize input text
48 | c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
49 | # Convert text to raw speech
50 | xw = nix.vocalize(c, c_length)
51 |
52 | # Listen to the generated speech
53 | Audio(xw[0,0], rate = 22050)
54 | ```
55 |
56 | ## **Acknowledgement**
57 | - This research is fully and exclusively funded by [Kata.ai](https://kata.ai), where the authors work as part of the [Kata.ai Research Team](https://kata.ai/research).
58 | - Some of the complex parts of our model, as mentioned in the paper, are adapted from the original implementation of [VITS](https://github.com/jaywalnut310/vits) and [Comprehensive-Transformer-TTS](https://github.com/keonlee9420/Comprehensive-Transformer-TTS).
59 |
--------------------------------------------------------------------------------
/nix/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rendchevi/nix-tts/67aebacc4dcf5d12692ea5088fcb491d928f6044/nix/__init__.py
--------------------------------------------------------------------------------
/nix/models/TTS.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import timeit
4 |
5 | import numpy as np
6 | import onnxruntime as ort
7 |
8 | from nix.tokenizers.tokenizer_en import NixTokenizerEN
9 |
10 | class NixTTSInference:
11 |
12 | def __init__(
13 | self,
14 | model_dir,
15 | ):
16 | # Load tokenizer
17 | self.tokenizer = NixTokenizerEN(pickle.load(open(os.path.join(model_dir, "tokenizer_state.pkl"), "rb")))
18 | # Load TTS model
19 | self.encoder = ort.InferenceSession(os.path.join(model_dir, "encoder.onnx"))
20 | self.decoder = ort.InferenceSession(os.path.join(model_dir, "decoder.onnx"))
21 |
22 | def tokenize(
23 | self,
24 | text,
25 | ):
26 | # Tokenize input text
27 | c, c_lengths, phonemes = self.tokenizer([text])
28 |
29 | return np.array(c, dtype = np.int64), np.array(c_lengths, dtype = np.int64), phonemes
30 |
31 | def vocalize(
32 | self,
33 | c,
34 | c_lengths,
35 | ):
36 | """
37 | Single-batch TTS inference
38 | """
39 | # Infer latent samples from encoder
40 | z = self.encoder.run(
41 | None,
42 | {
43 | "c": c,
44 | "c_lengths": c_lengths,
45 | }
46 | )[2]
47 | # Decode raw audio with decoder
48 | xw = self.decoder.run(
49 | None,
50 | {
51 | "z": z,
52 | }
53 | )[0]
54 |
55 | return xw
56 |
--------------------------------------------------------------------------------
/nix/tokenizers/tokenizer_en.py:
--------------------------------------------------------------------------------
1 | # Regex
2 | import re
3 |
4 | # Phonemizer
5 | from phonemizer.backend import EspeakBackend
6 | phonemizer_backend = EspeakBackend(
7 | language = 'en-us',
8 | preserve_punctuation = True,
9 | with_stress = True
10 | )
11 |
12 | class NixTokenizerEN:
13 |
14 | def __init__(
15 | self,
16 | tokenizer_state,
17 | ):
18 | # Vocab and abbreviations dictionary
19 | self.vocab_dict = tokenizer_state["vocab_dict"]
20 | self.abbreviations_dict = tokenizer_state["abbreviations_dict"]
21 |
22 | # Regex recipe
23 | self.whitespace_regex = tokenizer_state["whitespace_regex"]
24 | self.abbreviations_regex = tokenizer_state["abbreviations_regex"]
25 |
26 | def __call__(
27 | self,
28 | texts,
29 | ):
30 | # 1. Phonemize input texts
31 | phonemes = [ self._collapse_whitespace(
32 | phonemizer_backend.phonemize(
33 | self._expand_abbreviations(text.lower()),
34 | strip = True,
35 | )
36 | ) for text in texts ]
37 |
38 | # 2. Tokenize phonemes
39 | tokens = [ self._intersperse([self.vocab_dict[p] for p in phoneme], 0) for phoneme in phonemes ]
40 |
41 | # 3. Pad tokens
42 | tokens, tokens_lengths = self._pad_tokens(tokens)
43 |
44 | return tokens, tokens_lengths, phonemes
45 |
46 | def _expand_abbreviations(
47 | self,
48 | text
49 | ):
50 | for regex, replacement in self.abbreviations_regex:
51 | text = re.sub(regex, replacement, text)
52 |
53 | return text
54 |
55 | def _collapse_whitespace(
56 | self,
57 | text
58 | ):
59 | return re.sub(self.whitespace_regex, ' ', text)
60 |
61 | def _intersperse(
62 | self,
63 | lst,
64 | item,
65 | ):
66 | result = [item] * (len(lst) * 2 + 1)
67 | result[1::2] = lst
68 | return result
69 |
70 | def _pad_tokens(
71 | self,
72 | tokens,
73 | ):
74 | tokens_lengths = [len(token) for token in tokens]
75 | max_len = max(tokens_lengths)
76 | tokens = [token + [0 for _ in range(max_len - len(token))] for token in tokens]
77 | return tokens, tokens_lengths
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | onnxruntime==1.7.0
3 | phonemizer==2.2.1
4 |
--------------------------------------------------------------------------------