├── desta ├── __init__.py └── modeling_desta.py ├── assets ├── audios │ ├── 7_1_d7.wav │ ├── 0_000307.wav │ ├── 4_0_d47.wav │ ├── p284_159.wav │ ├── p287_162.wav │ ├── MUL0608_120.98_148.92.wav │ ├── Ses04F_impro02_M004.wav │ ├── Ses01F_script03_1_F029.wav │ ├── Ses01M_script01_1_F014.wav │ ├── common_voice_en_34980360.mp3 │ ├── SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav │ ├── LanguageIdentification_VoxForge_0000_de143-43.flac │ ├── DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav │ ├── AccentClassification_AccentdbExtended_0193_british_s01_176.wav │ ├── EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav │ ├── SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac │ └── NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav ├── images │ ├── method.png │ ├── model.png │ ├── figure1.png │ ├── model_training.png │ └── dataset_construction.png └── css │ └── styles.css ├── setup.py ├── LICENSE ├── demo.ipynb ├── .gitignore ├── app.py ├── README.md └── index.html /desta/__init__.py: -------------------------------------------------------------------------------- 1 | from .modeling_desta import Desta2Config, DestaModel -------------------------------------------------------------------------------- /assets/audios/7_1_d7.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/7_1_d7.wav -------------------------------------------------------------------------------- /assets/images/method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/method.png -------------------------------------------------------------------------------- /assets/images/model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/model.png -------------------------------------------------------------------------------- /assets/audios/0_000307.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/0_000307.wav -------------------------------------------------------------------------------- /assets/audios/4_0_d47.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/4_0_d47.wav -------------------------------------------------------------------------------- /assets/audios/p284_159.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/p284_159.wav -------------------------------------------------------------------------------- /assets/audios/p287_162.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/p287_162.wav -------------------------------------------------------------------------------- /assets/images/figure1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/figure1.png -------------------------------------------------------------------------------- /assets/images/model_training.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/model_training.png -------------------------------------------------------------------------------- /assets/audios/MUL0608_120.98_148.92.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/MUL0608_120.98_148.92.wav -------------------------------------------------------------------------------- /assets/audios/Ses04F_impro02_M004.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/Ses04F_impro02_M004.wav -------------------------------------------------------------------------------- /assets/images/dataset_construction.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/dataset_construction.png -------------------------------------------------------------------------------- /assets/audios/Ses01F_script03_1_F029.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/Ses01F_script03_1_F029.wav -------------------------------------------------------------------------------- /assets/audios/Ses01M_script01_1_F014.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/Ses01M_script01_1_F014.wav -------------------------------------------------------------------------------- /assets/audios/common_voice_en_34980360.mp3: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/common_voice_en_34980360.mp3 -------------------------------------------------------------------------------- /assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav -------------------------------------------------------------------------------- /assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac -------------------------------------------------------------------------------- /assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav -------------------------------------------------------------------------------- /assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav -------------------------------------------------------------------------------- /assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav -------------------------------------------------------------------------------- /assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac -------------------------------------------------------------------------------- /assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav -------------------------------------------------------------------------------- /assets/css/styles.css: -------------------------------------------------------------------------------- 1 | td audio { 2 | width: 100%; 3 | max-width: 300px; 4 | } 5 | td{ 6 | min-width: 200px; 7 | } 8 | 9 | @media (max-width: 768px) { 10 | audio { 11 | max-width: 300px; 12 | } 13 | 14 | } 15 | 16 | @media (max-width: 480px) { 17 | audio { 18 | max-width: 200px; 19 | } 20 | } -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='desta', 5 | version='1.0.0', 6 | packages=find_packages(), 7 | install_requires=[ 8 | "transformers", 9 | "librosa" 10 | ], 11 | author='Ke-Han Lu', 12 | description='An simple module for inferencing DeSTA model', 13 | long_description=open('README.md').read(), 14 | long_description_content_type='text/markdown', 15 | url='https://github.com/kehanlu/DeSTA2', 16 | ) -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | # License Notice 2 | 3 | This model is a derivative work that incorporates multiple components. 4 | To use this model, please refer to and comply with the following licenses: 5 | 6 | ## Foundation Models: 7 | 8 | - Whisper (MIT License) 9 | - https://github.com/openai/whisper/blob/main/LICENSE 10 | - Llama3-8b-Instruct 11 | - https://www.llama.com/llama3/license/ 12 | 13 | ## Datasets: 14 | 15 | - AccentDB (CC BY-NC 4.0) 16 | - https://accentdb.org/ 17 | - DailyTalk (CC BY-SA 4.0) 18 | - https://github.com/keonlee9420/DailyTalk 19 | - IEMOCAP 20 | - https://sail.usc.edu/iemocap/Data_Release_Form_IEMOCAP.pdf 21 | - VoxCeleb (CC BY 4.0) 22 | - https://mm.kaist.ac.kr/datasets/voxceleb/\#license 23 | - VCTK (CC BY 4.0) 24 | - https://huggingface.co/datasets/CSTR-Edinburgh/vctk 25 | - PromptTTS (Azure AI Services) 26 | - https://azure.microsoft.com/en-us/products/ai-services/ai-speech/#overview 27 | - Librispeech(CC BY 4.0) 28 | - [https://www.openslr.org/12](https://www.openslr.org/12) 29 | - LJSpeech 30 | - https://keithito.com/LJ-Speech-Dataset/ 31 | - Musan (CC BY 4.0) 32 | 33 | Please refer to each original license for complete terms and conditions. 34 | -------------------------------------------------------------------------------- /demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "application/vnd.jupyter.widget-view+json": { 11 | "model_id": "be4c476868494dba9f3e4d5a4ddf74c1", 12 | "version_major": 2, 13 | "version_minor": 0 14 | }, 15 | "text/plain": [ 16 | "Loading checkpoint shards: 0%| | 0/4 [00:00 [!important] 2 | > 🚀🚀 Our latest model **DeSTA2.5-Audio** is out! ➡️➡️ [📑 Paper](https://arxiv.org/abs/2507.02768) | [👩‍💻 Github](https://github.com/kehanlu/DeSTA2.5-Audio) 3 | 4 | # DeSTA2 5 | 6 | [📑 Paper](https://arxiv.org/pdf/2409.20007) | [🌐 Website](https://kehanlu.github.io/DeSTA2/) | [👩‍💻 Github](https://github.com/kehanlu/DeSTA2) | [🤗 Model](https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta) | [🤗 Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct) 7 | 8 | **Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data** 9 | 10 | > Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee 11 | 12 | This paper introduces DeSTA2, a novel approach for developing instruction-following speech language models (SLMs) without relying on speech instruction-tuning data. By leveraging text-based large language models to generate speech-text pairs from rich metadata, DeSTA2 achieves impressive performance on speech understanding benchmarks while retaining advanced reasoning capabilities. 13 | 14 | ![model](assets/images/model.png) 15 | 16 | ### Demo 17 | 18 | https://github.com/user-attachments/assets/20adfd47-1259-4cdd-9602-306d0d29711b 19 | 20 | ## ✨ News 21 | - 2025/07/03: 🚀🚀 DeSTA2.5-Audio is out! Check it out: [Github: DeSTA2.5-Audio](https://github.com/kehanlu/DeSTA2.5-Audio) 22 | - 2025/01/06: Gradio demo for DeSTA2. Simply run: `gradio app.py`. 23 | - 2025/01/05: Training code for DeSTA2. See [this branch](https://github.com/kehanlu/Nemo/tree/icassp/examples/desta) 24 | - 2024/12/21: DeSTA2 is accepted to ICASSP 2025! 25 | - 2024/10/10: We release DeSTA2-8B-beta model. See [🤗 Model](https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta) for details. 26 | - 2024/10/09: We release the dataset(speech captions). See [🤗 Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct) for details. 27 | - 2024/10/01: DeSTA2 paper is now available on arXiv. [📑 Paper](https://arxiv.org/pdf/2409.20007) 28 | 29 | 30 | ## Usage 31 | 32 | ### Load Model 33 | 34 | **A. Using AutoModel from Huggingface transformers** 35 | 36 | ```python 37 | HF_TOKEN = "hf_..." # your huggingface token for downloading Llama3 from official Meta repo 38 | 39 | from transformers import AutoModel 40 | 41 | model = AutoModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", trust_remote_code=True, token=HF_TOKEN) 42 | model.to("cuda") 43 | ``` 44 | 45 | **B. Using DestaModel from this repo** 46 | 47 | ``` 48 | git clone git@github.com:kehanlu/DeSTA2.git 49 | cd DeSTA2 50 | pip install -e . 51 | ``` 52 | 53 | ```python 54 | HF_TOKEN = "hf_..." # your huggingface token for downloading Llama3 from official Meta repo 55 | 56 | from desta import DestaModel 57 | 58 | model = DestaModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", token=HF_TOKEN) 59 | model.to("cuda") 60 | ``` 61 | 62 | 63 | 64 | ### Have fun! 65 | 66 | ```python 67 | messages = [ 68 | {"role": "system", "content": "You are a helpful voice assistant."}, 69 | {"role": "audio", "content": ""}, 70 | {"role": "user", "content": "Describe the audio."} 71 | ] 72 | 73 | generated_ids = model.chat( 74 | messages, 75 | max_new_tokens=128, 76 | do_sample=True, 77 | temperature=0.6, 78 | top_p=0.9 79 | ) 80 | 81 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 82 | print(response) 83 | ``` 84 | 85 | 86 | ### Examples 87 | 88 | **Complex reasoning ability for arbitrary user input** 89 | ```python 90 | messages = [ 91 | {"role": "system", "content": "Focus on the input audio."}, 92 | {"role": "audio", "content": "assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav"}, 93 | {"role": "user", "content": "How many speakers are there in the audio? Can you tell me the emotion of each speaker? Can you also provide a brief summary of the audio? Is there any background noise? How do you feel today? Can you give me a response to the audio?"}, 94 | ] 95 | generated_ids = model.chat(messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9) 96 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 97 | print(response) 98 | 99 | # == MODEL RESPONSE == 100 | # Based on the given audio snippet, here's my analysis: 101 | 102 | # 1. Number of speakers: 1 103 | # 2. Emotion of the speaker: Sadness 104 | # 3. Summary of the audio: The speaker is apologizing for causing trouble, indicating a sense of regret and remorse. 105 | # 4. Background noise: None mentioned, but it's possible that there might be some ambient noise that's not audible from the snippet. 106 | # 5. How I feel today: As an AI, I don't have emotions, but I'm designed to provide a neutral and objective analysis. 107 | # 6. Response to the audio: "It's okay, don't apologize." 108 | ``` 109 | 110 | **Math calculation and chain-of-thought reasoning based on the audio content** 111 | ```python 112 | messages = [ 113 | {"role": "system", "content": "Focus on the input audio."}, 114 | {"role": "audio", "content": "assets/audios/7_1_d7.wav"}, 115 | {"role": "user", "content": "How many speakers are there in the audio?\nCan you tell me the emotion of each speaker?(sad=1, angry=2, neutral=3, happy=4)?\nHow long is the audio in seconds? \n\nDo the math multiply the numbers from above 3 questions. Think step by step."}, 116 | ] 117 | 118 | generated_ids = model.chat(messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9) 119 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 120 | print(response) 121 | 122 | # == MODEL RESPONSE == 123 | # Let's break it down step by step: 124 | # 1. How many speakers are there in the audio? 125 | # There is only 1 speaker. 126 | # 2. Can you tell me the emotion of each speaker? 127 | # The emotion is happiness, which corresponds to a value of 4. 128 | # 3. How long is the audio in seconds? 129 | # The audio is 2 seconds long. 130 | 131 | # Now, let's do the math: 132 | # 1 x 4 x 2 = 8 133 | # So, the answer is 8! 134 | ``` 135 | 136 | **Multilingual ability** 137 | > Note: the model only trained on English data, but it is generalizable to other languages thanks to the multilingual capability of Llama3! 😊 138 | 139 | ```python 140 | messages = [ 141 | {"role": "system", "content": "請仔細聽輸入的音訊。"}, 142 | {"role": "audio", "content": "assets/audios/7_1_d7.wav"}, 143 | {"role": "user", "content": "請描述這段聲音,聲音聽起來如何?內容說了什麼?\n以中文回答。"}, 144 | ] 145 | 146 | generated_ids = model.chat(messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9) 147 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 148 | print(response) 149 | 150 | # == MODEL RESPONSE == 151 | # 這段聲音是一個女性的聲音,聽起來是很愉快的。她說了一句「Thank you for calling. Goodbye」,語氣溫柔、柔和,沒有任何激動或激情的感覺。整個聲音聽起來很平靜、很自然,沒有任何明顯的感覺或情感色彩。 152 | ``` 153 | 154 | **Multi-turn question-answering** 155 | ```python 156 | messages = [ 157 | {"role": "system", "content": "Focus on the audio clip."}, 158 | {"role": "audio", "content": "assets/audios/7_1_d7.wav"}, 159 | {"role": "user", "content": "Transcribe the audio."}, 160 | ] 161 | 162 | generated_ids = model.chat(messages, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9) 163 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 164 | print(response) 165 | 166 | # == MODEL RESPONSE == 167 | # Here is the transcription of the audio clip: 168 | # "Thank you for calling. Goodbye." 169 | 170 | messages.append({"role": "assistant", "content": response}) 171 | messages.append({"role": "user", "content": "Who is the speaker talking to?"}) 172 | generated_ids = model.chat(messages, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9) 173 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 174 | print(response) 175 | 176 | # == MODEL RESPONSE == 177 | # The speaker is speaking to someone who has called them, but it's not specified who that person is. The conversation appears to be a brief and abrupt goodbye, suggesting that the speaker may not have had a conversation with the caller or may have been interrupted. 178 | ``` 179 | 180 | ## Citation 181 | 182 | if you find our work useful, please consider citing the paper: 183 | 184 | ``` 185 | @article{lu2024developing, 186 | title={Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data}, 187 | author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi}, 188 | journal={arXiv preprint arXiv:2409.20007}, 189 | year={2024} 190 | } 191 | 192 | @inproceedings{lu24c_interspeech, 193 | title = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment}, 194 | author = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee}, 195 | year = {2024}, 196 | booktitle = {Interspeech 2024}, 197 | pages = {4159--4163}, 198 | doi = {10.21437/Interspeech.2024-457}, 199 | issn = {2958-1796}, 200 | } 201 | ``` 202 | -------------------------------------------------------------------------------- /desta/modeling_desta.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, WhisperForConditionalGeneration, PretrainedConfig, PreTrainedModel, BertConfig, AutoProcessor 2 | from transformers.models.bert.modeling_bert import BertEncoder 3 | from torch import nn 4 | import torch 5 | import os 6 | import librosa 7 | 8 | class Desta2Config(PretrainedConfig): 9 | model_type = "DestaModel" 10 | 11 | def __init__( 12 | self, 13 | llama_model_id="meta-llama/Meta-Llama-3-8B-Instruct", 14 | whisper_model_id="openai/whisper-small", 15 | prompt_size=64, 16 | **kwargs 17 | ): 18 | super().__init__(**kwargs) 19 | self.llama_model_id = llama_model_id 20 | self.whisper_model_id = whisper_model_id 21 | self.prompt_size = prompt_size 22 | 23 | self.whisper_config = AutoConfig.from_pretrained(self.whisper_model_id) 24 | self.llama_config = AutoConfig.from_pretrained(self.llama_model_id) 25 | 26 | class QformerConnector(PreTrainedModel): 27 | def __init__(self, cfg): 28 | super().__init__(cfg) 29 | self.cfg = cfg 30 | 31 | 32 | if self.cfg.whisper_model_id == "openai/whisper-medium": 33 | self.target_layer_ids = [5, 11, 17, 23] 34 | elif self.cfg.whisper_model_id == "openai/whisper-small": 35 | self.target_layer_ids = [2, 5, 8, 11] 36 | elif self.cfg.whisper_model_id == "openai/whisper-tiny": 37 | self.target_layer_ids = [0,1,2,3] 38 | elif self.cfg.whisper_model_id == "openai/whisper-large-v3": 39 | self.target_layer_ids = [3, 7, 11, 15, 19, 23, 27, 31] 40 | else: 41 | raise NotImplementedError(f"model_id {self.cfg.whisper_model_id} not implemented") 42 | 43 | 44 | self.layer_prompts = nn.ParameterList([ 45 | nn.Parameter(torch.randn(1, self.cfg.prompt_size, self.cfg.whisper_config.d_model)) for _ in range(len(self.target_layer_ids))] 46 | ) 47 | 48 | 49 | # (prompt_size, target_layers) 50 | self.layer_weights = nn.Parameter(torch.zeros(self.cfg.prompt_size, len(self.target_layer_ids), dtype=torch.float)) 51 | 52 | qformer_config = BertConfig() 53 | qformer_config.num_hidden_layers = 2 54 | qformer_config.num_attention_heads = self.cfg.whisper_config.encoder_attention_heads 55 | qformer_config.hidden_size = self.cfg.whisper_config.d_model 56 | qformer_config.add_cross_attention = True 57 | qformer_config.is_decoder = True 58 | 59 | self.qformer = BertEncoder(qformer_config) 60 | self.proj = nn.Sequential( 61 | nn.LayerNorm(self.cfg.whisper_config.d_model), 62 | nn.Linear(self.cfg.whisper_config.d_model, self.cfg.llama_config.hidden_size) # project to llama hidden size 63 | ) 64 | 65 | def forward(self, encoder_hidden_states): 66 | layer_prompt_outputs = [] 67 | for idx, encoder_hidden_state in enumerate(encoder_hidden_states): 68 | if idx in self.target_layer_ids: 69 | layer_prompt = self.layer_prompts[self.target_layer_ids.index(idx)].expand(encoder_hidden_state.size(0), -1, -1) 70 | qformer_output = self.qformer( 71 | hidden_states=layer_prompt, 72 | encoder_hidden_states=encoder_hidden_state, 73 | ) 74 | layer_prompt_output = qformer_output.last_hidden_state 75 | layer_prompt_outputs.append(layer_prompt_output) 76 | 77 | layer_prompt_outputs = torch.stack(layer_prompt_outputs, dim=0) 78 | layer_prompt_outputs = layer_prompt_outputs.permute(1, 2, 0, 3) 79 | 80 | self.norm_weights = torch.nn.functional.softmax(self.layer_weights, dim=-1).unsqueeze(-1) 81 | 82 | output = (layer_prompt_outputs * self.norm_weights).sum(dim=2) # (b, prompt_size, d_model) 83 | 84 | output = self.proj(output) 85 | 86 | return output 87 | 88 | class SpeechPerception(PreTrainedModel): 89 | def __init__(self, cfg): 90 | super().__init__(cfg) 91 | self.cfg = cfg 92 | 93 | self.whisper = WhisperForConditionalGeneration.from_pretrained(cfg.whisper_model_id) 94 | self.processor = AutoProcessor.from_pretrained(cfg.whisper_model_id) 95 | 96 | self.connector = QformerConnector(cfg) 97 | 98 | def generate(self, input_features): 99 | input_features = input_features.to(self.whisper.device) 100 | 101 | outputs = self.whisper.generate(input_features=input_features, return_dict_in_generate=True, output_hidden_states=True) # here we use default generate config for whisper 102 | 103 | transcriptions = self.processor.batch_decode(outputs.sequences, skip_special_tokens=True)[0] 104 | speech_features = self.connector(outputs.encoder_hidden_states) 105 | 106 | return transcriptions, speech_features 107 | 108 | 109 | class DestaModel(PreTrainedModel): 110 | config_class = Desta2Config 111 | 112 | def __init__(self, config, **kwargs): 113 | super().__init__(config) 114 | 115 | self.speech_perception = SpeechPerception(config) 116 | self.llama = AutoModelForCausalLM.from_pretrained(config.llama_model_id, torch_dtype=torch.bfloat16, **kwargs) 117 | self.tokenizer = AutoTokenizer.from_pretrained(config.llama_model_id, **kwargs) 118 | 119 | 120 | def chat(self, messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9): 121 | """ 122 | messages: list of dicts with keys "role" and "content" 123 | ``` 124 | [ 125 | {"role": "system", "content": "You are a helpful voice assistant."}, 126 | {"role": "audio", "content": ""}, 127 | {"role": "user", "content": "Describe the audio."} 128 | ] 129 | ``` 130 | """ 131 | 132 | audio_path, input_features = self.load_audio(messages) 133 | transcription, audio_features = self.speech_perception.generate(input_features) 134 | inputs, audio_position = self.process_text(messages, audio_path, transcription) 135 | 136 | inputs_embeds, attention_mask = self.prepare_llm_input( 137 | input_ids=inputs.input_ids, 138 | attention_mask=inputs.attention_mask, 139 | audio_position=audio_position, 140 | audio_features=audio_features 141 | ) 142 | 143 | outputs = self.llama.generate( 144 | inputs_embeds=inputs_embeds, 145 | attention_mask=attention_mask, 146 | pad_token_id=self.tokenizer.eos_token_id, 147 | max_new_tokens=max_new_tokens, 148 | do_sample=do_sample, 149 | temperature=temperature, 150 | top_p=top_p, 151 | ) 152 | return outputs 153 | 154 | def process_text(self, messages, audio_path, transcription): 155 | context = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) 156 | left_text, right_text = context.split(audio_path) 157 | right_text = transcription + right_text # 158 | 159 | audio_position = len(self.tokenizer.tokenize(left_text)) 160 | context = left_text + right_text 161 | 162 | inputs = self.tokenizer(context, return_tensors="pt") 163 | 164 | return inputs, audio_position 165 | 166 | 167 | def prepare_llm_input(self, input_ids, attention_mask, audio_position, audio_features): 168 | input_ids = input_ids.to(self.llama.device) 169 | attention_mask = attention_mask.to(self.llama.device) 170 | audio_features = audio_features.to(self.llama.device) 171 | audio_feature_length = audio_features.size(1) 172 | 173 | inputs_embeds = self.llama.model.embed_tokens(input_ids) # [bs, seq_len, hidden_size] 174 | 175 | 176 | inputs_embeds = torch.cat([inputs_embeds[0, :audio_position], audio_features[0, :], inputs_embeds[0, audio_position:]], dim=0) 177 | attention_mask = torch.cat([attention_mask[0, :audio_position], torch.ones([ audio_feature_length], dtype=torch.long, device=self.llama.device), attention_mask[0, audio_position:]], dim=0) 178 | 179 | inputs_embeds = inputs_embeds.to(self.llama.dtype) 180 | attention_mask = attention_mask.to(self.llama.dtype) 181 | return inputs_embeds.unsqueeze(0), attention_mask.unsqueeze(0) 182 | 183 | 184 | def load_audio(self, messages): 185 | audio_path = None 186 | for message in messages: 187 | if message["role"] == "audio" and audio_path is not None: 188 | raise ValueError("Multiple audio file paths found in messages. We only support one audio file per message at this moment.") 189 | if message["role"] == "audio": 190 | audio_path = message["content"] 191 | if audio_path is None: 192 | raise ValueError("No audio file path found in messages") 193 | audio, ori_sr = librosa.load(audio_path) 194 | audio = librosa.resample(audio, orig_sr=ori_sr, target_sr=16000) 195 | input_features = self.speech_perception.processor(audio, sampling_rate=16000, return_tensors="pt").input_features 196 | 197 | return audio_path, input_features 198 | 199 | @classmethod 200 | def from_pretrained(cls, pretrained_model_name_or_path, *model_args, config=None,**kwargs): 201 | config = cls.config_class.from_pretrained(pretrained_model_name_or_path, **kwargs) 202 | model = cls(config, **kwargs) 203 | 204 | if os.path.isdir(pretrained_model_name_or_path): 205 | model.speech_perception.connector.load_state_dict( 206 | torch.load(os.path.join(pretrained_model_name_or_path, "qformer_connector.pth")) 207 | ) 208 | else: 209 | from huggingface_hub import hf_hub_download 210 | path = hf_hub_download(repo_id=pretrained_model_name_or_path, filename="qformer_connector.pth") 211 | model.speech_perception.connector.load_state_dict( 212 | torch.load(path) 213 | ) 214 | 215 | return model 216 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data 9 | 10 | 11 |
12 |
13 |
14 |

15 | DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data 16 |

17 |
18 |

19 | Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank 20 | Wang, Hung-yi Lee 21 |

22 |

National Taiwan University, NVIDIA

23 | 24 | 📑 Paper | 25 | 🌐 Website | 26 | 👩‍💻 Github | 27 | 🤗 Model | 28 | 🤗 Dataset 29 | 30 | 44 |
45 | 46 |
47 |
48 |
49 | 50 |
51 |
52 |
53 |
54 |

✨✨ Key features

55 | 56 |
    57 |
  • 58 | We propose an automatic approach to develop speech language models without using speech 59 | instruction-tuning data. We aim to inject speech understanding capability into text-based LLMs while 60 | keeping their original versatility. 61 |
  • 62 |
  • 63 | We train the end-to-end speech language model with carefully constructed 64 | speech captions to avoid the catastrophic forgetting problem. 65 |
  • 66 |
  • 67 | The model shows strong instruction-following ability and 68 | demonstrates robust speech understanding capability on the 69 | Dynamic-SUPERB and 70 | AIR-Bench-Chat benchmarks. 71 |
  • 72 |
73 |
74 |
75 |
76 |
77 | 78 |
79 |
80 |
81 |
82 |

Instruction-following Ability

83 |

84 | Our model not only demonstrates strong speech understanding capabilities but also excels in following 85 | complex instructions, such as following specific output formatting and chain-of-thought reasoning. 86 |

87 |
88 | 89 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 105 | 106 | 107 | 108 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 128 | 129 | 130 | 131 | 139 | 140 | 141 | 142 | 149 | 150 | 151 | 152 | 157 | 170 | 171 | 172 | 178 | 182 | 183 | 184 | 185 | 192 | 202 | 203 | 204 |
InstructionModel Output
102 | What is the gender of the speaker? 103 | Your entire response should be in English, capital letters only. 104 | THE GENDER OF THE SPEAKER IS FEMALE.
109 | What is the emotion of the speaker? 110 | Your entire response should be in English, capital letters only. 111 | HAPPINESS
What is the gender of the speaker? Answer in JSON format.{"gender": "Female" }
What is the emotion conveyed in the speech? Think step by step. 121 | Let's break it down step by step!
1. We start with the phrase "Thank you for calling". This 122 | implies that a request has been made, and the speaker is expressing gratitude.
2. The next phrase 123 | is "Goodbye". This is a normal way to end a conversation, and it's a polite way to part ways.
Considering 124 | these two phrases together, we can infer that the speaker is happy to end the conversation and is not 125 | expecting anything further. The tone is friendly and grateful, suggesting that the speaker has had a 126 | positive interaction.
Therefore, the emotion conveyed in the speech is happiness! 127 |
Analyze the speech, then tell me what is the emotion conveyed in the speech? 132 | Let's break down the speech:
* "Thank you for calling": This phrase is a polite expression of 133 | gratitude for the person's call.
* "Goodbye": This is a final farewell, indicating that the 134 | conversation is ending.
The tone of the speech is polite and friendly, as the speaker is thanking 135 | the person for calling and bidding them farewell.
The emotion conveyed in this speech is 136 | happiness. The speaker seems cheerful and appreciative, expressing gratitude for the call and ending 137 | the conversation on a positive note. 138 |
Calculate the words per second rate, think step by step. 143 | Let's break it down step by step:
1. The speech lasts for 2 seconds: [00:00:00 - 00:00:02]
2. 144 | The number of words is: "Thank you for calling. Goodbye." (5 words)
3. To find the words per 145 | second rate, we need to divide the number of words by the time it took to say the words:
Words 146 | per second = Total number of words / Time taken (in seconds)
= 5 words / 2 seconds
= 2.5 147 | words per second
Therefore, the words per second rate is 2.5 words 148 |
153 | How many speakers are there in the audio? Can you tell me the emotion of each speaker? Can you also 154 | provide a brief summary of the audio? Is there any background noise? How do you feel today? Can you 155 | give me a response to the audio? 156 | 158 | There is only 1 speaker in the audio.
159 | 160 | The speaker is a female, and her emotion is happiness.
161 | 162 | The audio is a brief goodbye message, where the speaker says "Thank you for calling. Goodbye."
163 | 164 | There is no background noise in the audio.
165 | 166 | As a machine, I don't have emotions, so I don't feel anything.
167 | 168 | As for a response to the audio, I can say: "You're welcome, and have a great day!" 169 |
173 | 請描述這段聲音,聲音聽起來如何?內容說了什麼? 以中文回答。

Note: our model only trained on English data, but it is generalizable to other languages thanks to 175 | the multilingual capability of Llama3! 😊 176 | 177 |
179 | 這段聲音是一個女性的聲音,聽起來是很愉快的。她說了一句「Thank you for calling. 180 | Goodbye」,語氣溫柔、柔和,沒有任何激動或激情的感覺。整個聲音聽起來很平靜、很自然,沒有任何明顯的感覺或情感色彩。 181 |
186 | Multi-turn question-answering
187 | First question
188 | Transcribe the audio.

189 | Second question
190 | Who is the speaker talking to? 191 |
193 | First response
194 | Here is the transcription of the audio clip:
195 | "Thank you for calling. Goodbye."
196 |
197 | Second response
198 | The speaker is speaking to someone who has called them, but it's not specified who that person is. The 199 | conversation appears to be a brief and abrupt goodbye, suggesting that the speaker may not have had a 200 | conversation with the caller or may have been interrupted. 201 |
205 |
206 |
207 |
208 | 209 |
210 |
211 |
212 |

Speech Understanding Ability

213 |
214 |

Dynamic-SUPERB

215 |

216 | Dynamic-SUPERB is a crowdsourced benchmark 217 | comprising 48 speech-related classification tasks with human-created instructions and answers. 218 |

219 |
220 | 221 |
222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 239 | 243 | 248 | 249 | 250 | 251 | 260 | 264 | 269 | 270 | 271 | 272 | 281 | 285 | 288 | 289 | 290 | 291 | 300 | 304 | 309 | 310 | 311 | 312 | 321 | 325 | 329 | 330 | 331 | 332 | 338 | 342 | 347 | 348 | 349 | 350 | 358 | 362 | 369 | 370 | 371 | 372 |
AudioInstructionModel OutputGround Truth
234 | 237 |

LanguageIdentification

238 |
240 | Which language can you recognize from this audio recording? The answer could be German, English, 241 | Spanish, Italian, Russian, or French. 242 | 244 | Based on the audio recording, I recognize the language as German. The phrase "Die hätte man dicht 245 | machen müssen" is a grammatically correct and idiomatic German sentence, and the speaker's accent 246 | and pronunciation also suggest a German speaker. 247 | German
252 | 258 |

NoiseDetection

259 |
261 | Determine whether the sentence is a noise-free utterance or a noisy one with music disturbance. The 262 | answer could be clean or noisy. 263 | 265 | Based on the provided information, I would classify the sentence as "noisy". The noise level is 266 | moderate (15db), and there is a reverberation time of 60ms, which suggests that there is some 267 | background noise or music disturbance present. 268 | Noisy
273 | 279 |

AccentClassification

280 |
282 | The task involves detecting variations in pronunciation and intonation. The answer could be 283 | american, australian, bangla, british, indian, malayalam, odiya, telugu, or welsh. 284 | 286 | Based on the accent and pronunciation, I would identify the speaker as having a British accent. 287 | British
292 | 298 |

EmotionRecognition

299 |
301 | Recognize and categorize the emotions expressed in spoken language. The answer could be anger, 302 | disgust, sadness, joy, neutral, surprise, or fear. 303 | 305 | Based on the audio clip, the emotion expressed is "joy". The speaker's tone and words ("Here she is, 306 | future fashion superstar") convey a sense of excitement and happiness, indicating a positive 307 | emotional state. 308 | Joy
313 | 319 |

DialogueEmotionClassification

320 |
322 | Analyze the dialogue and identify its corresponding emotion. The answer could be anger, disgust, 323 | fear, sadness, happiness, surprise, or no emotion. 324 | 326 | Based on the dialogue, the corresponding emotion is sadness. The speaker is apologizing and 327 | expressing regret for causing trouble, which is a characteristic of a sad tone. 328 | Sadness
333 | 336 |

SpeechTextMatching

337 |
339 | Verify if the speech and text are congruent. The answer could be yes or no.
Text: "guidelines 340 | for an experimental program to develop more detailed criteria" 341 |
343 | The answer is NO.
The speech is about saying a few words on the principles of design and 344 | typography, but the text is about guidelines for an experimental program to develop more detailed 345 | criteria, which is a different topic. 346 |
No
351 | 356 |

SpeakerVerification

357 |
359 | Scrutinize the first audio clip to establish whether the second one is delivered by the same 360 | speaker. The answer could be yes or no. 361 | 363 | Based on the provided audio clip, I would say NO, the second audio clip is not delivered by the same 364 | speaker. The tone, pitch, and emotion expressed in the first audio clip (neutral, calm, and 365 | introspective) are different from what I would expect from the same speaker. The first audio clip 366 | has a more contemplative and reflective tone, whereas the second audio clip is more dramatic and 367 | emotive. 368 | No
373 |
374 | 375 |
376 |

AIR-Bench-Chat

377 |

378 | AIR-Bench-Chat is an LLM-generated benchmark derived from 379 | speech metadata that focuses on open-ended questions. 380 |

381 |
382 |
383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 399 | 402 | 408 | 412 | 413 | 414 | 419 | 423 | 428 | 429 | 430 | 431 | 436 | 437 | 446 | 447 | 448 | 449 | 454 | 455 | 456 | 457 | 458 | 459 |
AudioInstructionModel OutputGround Truth
395 | 398 | 400 | Based on the speaker's emotional state, what could be the nature of the proof being requested? 401 | 403 | Based on the speaker's angry tone and high-pitched voice, it is likely that the proof being 404 | requested is related to a dispute or controversy. The speaker is demanding evidence to support a 405 | claim or argument, and is likely feeling frustrated or upset that the other party has not provided 406 | sufficient evidence to back up their claims. 407 | 409 | Given the speaker's anger, the proof likely relates to a serious allegation or an issue that 410 | significantly impacts the speaker.. 411 |
415 | 418 | 420 | What emotion is the speaker likely trying to convey to the listener by stating a fear of their 421 | departure? 422 | 424 | Based on the given information, the speaker is likely trying to convey a sense of sadness to the 425 | listener by stating a fear of their departure. The low pitch, slow speaking speed, and very low 426 | volume suggest a somber and melancholic tone, which is consistent with a sad emotion. 427 | The speaker is trying to convey a sense of sadness and attachment to the listener.
432 | 435 | What could be the next step for the manager after witnessing such a scene? 438 | Based on the manager's tone and emotion, it seems that she is amused and entertained by the scene. 439 | Given her high pitch and volume, she might be trying to emphasize the absurdity of the situation. 440 | Considering her slow speaking speed, she might be savoring the moment and enjoying the spectacle.

As 441 | a next step, the manager could:
* Laugh and join in on the fun, possibly even taking a photo or 442 | video to capture the moment.
* Try to intervene and calm the situation down, but in a playful 443 | and lighthearted manner, acknowledging the absurdity of the situation.
* Use the opportunity to 444 | make a humorous 445 |
The manager might need to intervene or address the situation occurring on the floor.
450 | 453 | What did the second person inquire about after the first person's request?The second person inquired about the price range of the restaurant.The second person inquired about any specifications on the price range.
460 |
461 |
462 |
463 |
464 | 465 |
466 |
467 |
468 |

Method

469 |
470 |

Stage 1: Dataset Consturction

471 | 472 |

473 | In our dataset construction process, we first extract and format all textual attributes from the speech 474 | into a structured "seed transcript." This seed transcript is then combined with the prompt "What can you 475 | hear from the audio?" and fed into Llama3. The response generated by Llama3 becomes our training target 476 | for the end-to-end speech language model. 477 |

478 |

479 | There are 12 attributes, including gender, age, accent, emotion, pitch, volume, speaking speed, SNR level, 480 | C50 value, duration, intent, and spoken text. 481 |

482 | 483 | 484 |
485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 499 | 504 | 510 | 511 | 512 | 516 | 521 | 527 | 528 | 529 | 533 | 537 | 543 | 544 | 545 | 551 | 559 | 565 | 566 | 567 |
Seed TranscriptLlama3-Instruct ResponseCorrespond Audio
496 | [00:00:01 - 00:00:02]: "These days are long gone." (Gender:Male, Accent: Scottish, Age: 20, 497 | Emotion: neutral, Duration: 2s) 498 | 500 | From the audio, I can hear:
* A male voice with a Scottish accent
* The voice is 501 | approximately 20 years old
* The tone is neutral, indicating a lack of strong emotions
* 502 | The audio duration is 2 seconds 503 |
505 | 508 |

VCTK-Corpus

509 |
513 | [00:00:01 - 00:00:04]: "At the top level, the salaries are far too high." (Gender:Male, Accent: 514 | English, Age: 23, Emotion: neutral, Duration: 4s) 515 | 517 | From the audio, we can hear a male speaker with an English accent (possibly British English) 518 | saying:
"At the top level, the salaries are far too high."
The speaker is a 23-year-old 519 | male with a neutral emotion. The audio lasts for 4 seconds. 520 |
522 | 525 |

VCTK-Corpus

526 |
530 | [00:00:01 - 00:00:04]: "Oh, lovely. Where's the wedding going to be?" (Gender:Male, Emotion: 531 | happiness, Action: question, Duration: 4s) 532 | 534 | From the audio, I can hear a male speaker asking a question with a tone of happiness, specifically 535 | the words: "Oh, lovely. Where's the wedding going to be?" 536 | 538 | 541 |

Dailytalk

542 |
546 | [00:00:01 - 00:00:07]: "A new pride taught me mine ego, and that teach I unto men: no longer to 547 | thrust one's head into the sand of celestial things, but to carry it freely, a terrestrial head, 548 | which giveth meaning to the earth!" (Gender:Male, Emotion: cheerful, Pitch: Normal, Volume: 549 | Normal, Speaking speed: Fast, Duration: 7s) 550 | 552 | From the given audio, we can hear a male speaker with a cheerful tone, speaking at a normal pitch 553 | and volume. The speaking speed is relatively fast, and the duration is 7 seconds.
The audio 554 | appears to be a philosophical or inspirational quote, with the speaker using poetic language to 555 | convey a message. The text itself is a condensed and somewhat abstract statement, but it seems to 556 | be encouraging people to be more grounded and down-to-earth, rather than getting lost in lofty or 557 | abstract concepts. 558 |
560 | 563 |

PromptTTS

564 |
568 |
569 |
570 |
571 |
572 |

Stage 2: Model training

573 | 574 |

575 | During the model training phase, we utilize a pre-trained Whisper model to extract speech features and 576 | generate text transcriptions from the audio input. These speech features and the text embeddings (from 577 | the transcription and chat template) are concatenated before being fed into Llama3. We maintain a 578 | consistent template structure to prevent any mismatch. Finally, the model is trained to generate speech 579 | descriptions, effectively learning to interpret and articulate the nuances of spoken content based on 580 | both acoustic features and textual information. 581 |

582 | 583 | 584 |
585 |
586 |
587 |
588 |
589 | 590 | 591 | --------------------------------------------------------------------------------