3 |
4 | 
5 |
6 |
11 |
12 | [](https://github.com/Camb-ai/MARS5-TTS/stargazers)
13 | [](https://discord.gg/FFQNCSKSXX)
14 | [](https://huggingface.co/CAMB-AI/MARS5-TTS)
15 | [](https://colab.research.google.com/github/Camb-ai/mars5-tts/blob/master/mars5_demo.ipynb)
16 |
17 |
18 |
19 |
20 | # Updates
21 | <> July 5, 2024: Latest AR checkpoint released: higher stability of output. Very big update coming soon!
22 |
23 |
24 | # Approach
25 |
26 | This is the repo for the MARS5 English speech model (TTS) from CAMB.AI.
27 |
28 | The model follows a two-stage AR-NAR pipeline with a distinctively novel NAR component (see more info in the [Architecture](docs/architecture.md)).
29 |
30 | With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more. Check out our demo:
31 |
32 |
33 |
34 |
35 | https://github.com/Camb-ai/MARS5-TTS/assets/23717819/3e191508-e03c-4ff9-9b02-d73ae0ebefdd
36 |
37 |
38 | Watch full video here: [](https://www.youtube.com/watch?v=bmJSLPYrKtE)
39 |
40 | 
41 |
42 | **Figure**: The high-level architecture flow of MARS5. Given text and a reference audio, coarse (L0) encodec speech features are obtained through an autoregressive transformer model. Then, the text, reference, and coarse features are refined in a multinomial DDPM model to produce the remaining encodec codebook values. The output of the DDPM is then vocoded to produce the final audio.
43 |
44 | Because the model is trained on raw audio together with byte-pair-encoded text, it can be steered with things like punctuation and capitalization.
45 | E.g. To add a pause, add a comma to that part in the transcript. Or, to emphasize a word, put it in capital letters in the transcript.
46 | This enables a fairly natural way for guiding the prosody of the generated output.
47 |
48 | Speaker identity is specified using an audio reference file between 2-12 seconds, with lengths around 6s giving optimal results.
49 | Further, by providing the transcript of the reference, MARS5 enables one to do a '_deep clone_' which improves the quality of the cloning and output, at the cost of taking a bit longer to produce the audio.
50 | For more details on this and other performance and model details, please see the [docs folder](docs/architecture.md).
51 |
52 | ## Quick links
53 |
54 | - [CAMB.AI website](https://camb.ai/) (access MARS in 140+ languages for TTS and dubbing)
55 | - Technical details and architecture: [in the docs folder](docs/architecture.md)
56 | - Colab quickstart: [](https://colab.research.google.com/github/Camb-ai/mars5-tts/blob/master/mars5_demo.ipynb)
57 | - Sample page with a few hard prosodic samples: [https://camb-ai.github.io/MARS5-TTS/](https://camb-ai.github.io/MARS5-TTS/)
58 | - Online demo: [here](https://6b1a3a8e53ae.ngrok.app/)
59 |
60 |
61 | ## Quickstart
62 |
63 |
64 | We use `torch.hub` to make loading the model easy -- no cloning of the repo needed. The steps to perform inference are simple:
65 |
66 | 1. **Installation using pip**:
67 |
68 | Requirements:
69 | - Python >= 3.10
70 | - Torch >= 2.0
71 | - Torchaudio
72 | - Librosa
73 | - Vocos
74 | - Encodec
75 | - safetensors
76 | - regex
77 |
78 | ```bash
79 | pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
80 | ```
81 |
82 | 2. **Load models**: load the MARS5 AR and NAR model from torch hub:
83 |
84 | ```python
85 | import torch, librosa
86 |
87 | mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
88 | # The `mars5` contains the AR and NAR model, as well as inference code.
89 | # The `config_class` contains tunable inference config settings like temperature.
90 | ```
91 |
92 | (Optional) Load Model from huggingface (make sure repository is cloned)
93 | ```python
94 | from inference import Mars5TTS, InferenceConfig as config_class
95 | import torch, librosa
96 |
97 | mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")
98 | ```
99 |
100 |
101 | 3. **Pick a reference** and optionally its transcript:
102 |
103 | ```python
104 | # Load reference audio between 1-12 seconds.
105 | wav, sr = librosa.load('