├── desta
    ├── __init__.py
    └── modeling_desta.py
├── assets
    ├── audios
    │   ├── 7_1_d7.wav
    │   ├── 0_000307.wav
    │   ├── 4_0_d47.wav
    │   ├── p284_159.wav
    │   ├── p287_162.wav
    │   ├── MUL0608_120.98_148.92.wav
    │   ├── Ses04F_impro02_M004.wav
    │   ├── Ses01F_script03_1_F029.wav
    │   ├── Ses01M_script01_1_F014.wav
    │   ├── common_voice_en_34980360.mp3
    │   ├── SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav
    │   ├── LanguageIdentification_VoxForge_0000_de143-43.flac
    │   ├── DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav
    │   ├── AccentClassification_AccentdbExtended_0193_british_s01_176.wav
    │   ├── EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav
    │   ├── SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac
    │   └── NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav
    ├── images
    │   ├── method.png
    │   ├── model.png
    │   ├── figure1.png
    │   ├── model_training.png
    │   └── dataset_construction.png
    └── css
    │   └── styles.css
├── setup.py
├── LICENSE
├── demo.ipynb
├── .gitignore
├── app.py
├── README.md
└── index.html


/desta/__init__.py:
--------------------------------------------------------------------------------
1 | from .modeling_desta import Desta2Config, DestaModel


--------------------------------------------------------------------------------
/assets/audios/7_1_d7.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/7_1_d7.wav


--------------------------------------------------------------------------------
/assets/images/method.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/method.png


--------------------------------------------------------------------------------
/assets/images/model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/model.png


--------------------------------------------------------------------------------
/assets/audios/0_000307.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/0_000307.wav


--------------------------------------------------------------------------------
/assets/audios/4_0_d47.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/4_0_d47.wav


--------------------------------------------------------------------------------
/assets/audios/p284_159.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/p284_159.wav


--------------------------------------------------------------------------------
/assets/audios/p287_162.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/p287_162.wav


--------------------------------------------------------------------------------
/assets/images/figure1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/figure1.png


--------------------------------------------------------------------------------
/assets/images/model_training.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/model_training.png


--------------------------------------------------------------------------------
/assets/audios/MUL0608_120.98_148.92.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/MUL0608_120.98_148.92.wav


--------------------------------------------------------------------------------
/assets/audios/Ses04F_impro02_M004.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/Ses04F_impro02_M004.wav


--------------------------------------------------------------------------------
/assets/images/dataset_construction.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/images/dataset_construction.png


--------------------------------------------------------------------------------
/assets/audios/Ses01F_script03_1_F029.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/Ses01F_script03_1_F029.wav


--------------------------------------------------------------------------------
/assets/audios/Ses01M_script01_1_F014.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/Ses01M_script01_1_F014.wav


--------------------------------------------------------------------------------
/assets/audios/common_voice_en_34980360.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/common_voice_en_34980360.mp3


--------------------------------------------------------------------------------
/assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav


--------------------------------------------------------------------------------
/assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac


--------------------------------------------------------------------------------
/assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav


--------------------------------------------------------------------------------
/assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav


--------------------------------------------------------------------------------
/assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav


--------------------------------------------------------------------------------
/assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac


--------------------------------------------------------------------------------
/assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kehanlu/DeSTA2/HEAD/assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav


--------------------------------------------------------------------------------
/assets/css/styles.css:
--------------------------------------------------------------------------------
 1 | td audio {
 2 |   width: 100%;
 3 |   max-width: 300px;
 4 | }
 5 | td{
 6 |   min-width: 200px;
 7 | }
 8 | 
 9 | @media (max-width: 768px) {
10 |   audio {
11 |     max-width: 300px;
12 |   }
13 |   
14 | }
15 | 
16 | @media (max-width: 480px) {
17 |   audio {
18 |     max-width: 200px;
19 |   }
20 | }


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | setup(
 4 |     name='desta',
 5 |     version='1.0.0',
 6 |     packages=find_packages(),
 7 |     install_requires=[
 8 |         "transformers",
 9 |         "librosa"
10 |     ],
11 |     author='Ke-Han Lu',
12 |     description='An simple module for inferencing DeSTA model',
13 |     long_description=open('README.md').read(),
14 |     long_description_content_type='text/markdown',
15 |     url='https://github.com/kehanlu/DeSTA2',
16 | )


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | # License Notice
 2 | 
 3 | This model is a derivative work that incorporates multiple components.   
 4 | To use this model, please refer to and comply with the following licenses:
 5 | 
 6 | ## Foundation Models:
 7 | 
 8 | - Whisper (MIT License)  
 9 |   - https://github.com/openai/whisper/blob/main/LICENSE  
10 | - Llama3-8b-Instruct  
11 |   - https://www.llama.com/llama3/license/
12 | 
13 | ## Datasets:
14 | 
15 | - AccentDB (CC BY-NC 4.0)  
16 |   - https://accentdb.org/  
17 | - DailyTalk (CC BY-SA 4.0)  
18 |   - https://github.com/keonlee9420/DailyTalk  
19 | - IEMOCAP  
20 |   - https://sail.usc.edu/iemocap/Data_Release_Form_IEMOCAP.pdf  
21 | - VoxCeleb (CC BY 4.0)  
22 |   - https://mm.kaist.ac.kr/datasets/voxceleb/\#license  
23 | - VCTK (CC BY 4.0)  
24 |   - https://huggingface.co/datasets/CSTR-Edinburgh/vctk  
25 | - PromptTTS (Azure AI Services)  
26 |   - https://azure.microsoft.com/en-us/products/ai-services/ai-speech/#overview
27 | - Librispeech(CC BY 4.0)  
28 |   - [https://www.openslr.org/12](https://www.openslr.org/12)  
29 | - LJSpeech  
30 |   - https://keithito.com/LJ-Speech-Dataset/  
31 | - Musan (CC BY 4.0)
32 | 
33 | Please refer to each original license for complete terms and conditions.
34 | 


--------------------------------------------------------------------------------
/demo.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 1,
 6 |    "metadata": {},
 7 |    "outputs": [
 8 |     {
 9 |      "data": {
10 |       "application/vnd.jupyter.widget-view+json": {
11 |        "model_id": "be4c476868494dba9f3e4d5a4ddf74c1",
12 |        "version_major": 2,
13 |        "version_minor": 0
14 |       },
15 |       "text/plain": [
16 |        "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
17 |       ]
18 |      },
19 |      "metadata": {},
20 |      "output_type": "display_data"
21 |     }
22 |    ],
23 |    "source": [
24 |     "from desta import DestaModel\n",
25 |     "\n",
26 |     "model = DestaModel.from_pretrained(\"DeSTA-ntu/DeSTA2-8B-beta\").to(\"cuda\")"
27 |    ]
28 |   },
29 |   {
30 |    "cell_type": "code",
31 |    "execution_count": 5,
32 |    "metadata": {},
33 |    "outputs": [
34 |     {
35 |      "name": "stdout",
36 |      "output_type": "stream",
37 |      "text": [
38 |       "The audio clip is a 2-second recording of a female speaker saying \"Thank you for calling. Goodbye.\" in a happy tone.\n"
39 |      ]
40 |     }
41 |    ],
42 |    "source": [
43 |     "messages = [\n",
44 |     "            {\"role\": \"system\", \"content\": \"Focus on the audio clip.\"},\n",
45 |     "            {\"role\": \"audio\", \"content\": \"/home/khlu/lab/projects/DeSTA2/assets/audios/7_1_d7.wav\"},\n",
46 |     "            {\"role\": \"user\", \"content\": \"Describe the audio clip.\"},\n",
47 |     "        ]\n",
48 |     "\n",
49 |     "generated_ids = model.chat(messages, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)\n",
50 |     "response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
51 |     "print(response)\n"
52 |    ]
53 |   },
54 |   {
55 |    "cell_type": "code",
56 |    "execution_count": null,
57 |    "metadata": {},
58 |    "outputs": [],
59 |    "source": []
60 |   }
61 |  ],
62 |  "metadata": {
63 |   "kernelspec": {
64 |    "display_name": "Python 3",
65 |    "language": "python",
66 |    "name": "python3"
67 |   },
68 |   "language_info": {
69 |    "codemirror_mode": {
70 |     "name": "ipython",
71 |     "version": 3
72 |    },
73 |    "file_extension": ".py",
74 |    "mimetype": "text/x-python",
75 |    "name": "python",
76 |    "nbconvert_exporter": "python",
77 |    "pygments_lexer": "ipython3",
78 |    "version": "3.10.12"
79 |   }
80 |  },
81 |  "nbformat": 4,
82 |  "nbformat_minor": 2
83 | }
84 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | # install all needed dependencies.
 92 | Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | # pytype static type analyzer
132 | .pytype/
133 | 
134 | # Cython debug symbols
135 | cython_debug/
136 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | # Gradio app
  2 | # A chatbot that supports Audio inputs(user can upload an audio file.)
  3 | 
  4 | # from transformers import AutoModel, AutoTokenizer
  5 | 
  6 | import gradio as gr
  7 | from transformers import AutoModel
  8 | 
  9 | 
 10 | 
 11 | if gr.NO_RELOAD:
 12 |     model = AutoModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", trust_remote_code=True)
 13 |     model.to("cuda")
 14 |     model.eval()
 15 | 
 16 | def reset_chat(history, chatbot):
 17 |     history = [{"role": "system", "content": "Focus on the input audio. You are a helpful voice assistant."}]
 18 |     # history.clear()
 19 |     return (history, None, gr.update(interactive=False), gr.update(interactive=True))
 20 | 
 21 | def upload_audio(history, speech, text_box, chatbot, chat_button, upload_button):
 22 |     # {"role": "audio", "content": "assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav"},
 23 |     print(speech)
 24 |     if speech is None:
 25 |         gr.Warning("⚠️ Please upload an audio file first!", duration=5)
 26 |         return (history, speech, text_box, chatbot, chat_button, upload_button)
 27 |     history.append({"role": "audio", "content": speech})
 28 |     chatbot.append([f"Speech: \n\n{speech}", None])
 29 | 
 30 |     return (
 31 |         history,
 32 |         gr.update(interactive=True), # speech box
 33 |         gr.update(interactive=True, placeholder="Start chatting!"), # text_box,
 34 |         chatbot,
 35 |         gr.update(interactive=True), # chat_button,
 36 |         gr.update(interactive=False) # upload_button
 37 |     )
 38 | 
 39 | def user_send_message(history, speech, text_box, chatbot):
 40 |     history.append({"role": "user", "content": text_box})
 41 |     chatbot.append([f"{text_box}", None])
 42 | 
 43 |     return (
 44 |         history,
 45 |         speech,
 46 |         gr.update(interactive=True, placeholder="Start chatting!", value=""), # text_box,
 47 |         chatbot,
 48 |     )
 49 | 
 50 | def model_response(history, speech, text_box, chatbot):
 51 |     
 52 |     print(history)
 53 | 
 54 |     messages = history
 55 |     generated_ids = model.chat(messages, max_new_tokens=128, do_sample=False, temperature=1.0, top_p=1.0)
 56 |     response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 57 |     
 58 |     history.append({"role": "assistant", "content": response})
 59 |     chatbot[-1][1] = response
 60 |     return (
 61 |         history,
 62 |         speech,
 63 |         gr.update(interactive=True, placeholder="Start chatting!"), # text_box,
 64 |         chatbot,
 65 |     )
 66 | 
 67 | 
 68 | with gr.Blocks() as demo:
 69 |     gr.Markdown("# DeSTA2 demo page")
 70 |     message_box = gr.Markdown(value="have fun!", label="Message")
 71 | 
 72 | 
 73 |     history = gr.State([{ "role": "system", "content": "Focus on the input audio. You are a helpful voice assistant." }])
 74 |     # history = gr.State([])
 75 |     with gr.Row():
 76 |         chatbot = gr.Chatbot(label="DeSTA2", height="100%", min_height="400px")
 77 |     
 78 |     with gr.Row():
 79 |         with gr.Column():
 80 |             speech = gr.Audio(label="Audio", type="filepath", sources=["microphone", "upload"])
 81 |             upload_button = gr.Button("Upload")
 82 |         with gr.Column():
 83 |             text_box = gr.Textbox(label="User", interactive=False, placeholder="Upload an audio first!")
 84 |             chat_button = gr.Button("Send", interactive=False)
 85 | 
 86 |     with gr.Row():
 87 |         # top_p = gr.Slider(minimum=0.0, maximum=1.0, value=1.0, label="Top P")
 88 |         # temperature = gr.Slider(minimum=0.0, maximum=1.0, value=1.0, label="Temperature")
 89 |         gr.Button("Reset chat").click(reset_chat, 
 90 |                                       inputs=[history, chatbot], 
 91 |                                       outputs=[history, chatbot, chat_button, upload_button])
 92 | 
 93 |     upload_button.click(upload_audio, 
 94 |                         inputs=[history, speech, text_box, chatbot, chat_button, upload_button], 
 95 |                         outputs=[history, speech, text_box, chatbot, chat_button, upload_button]
 96 |                         )
 97 |     chat_button.click(user_send_message, 
 98 |                       inputs=[history, speech, text_box, chatbot], 
 99 |                       outputs=[history, speech, text_box, chatbot]).then(
100 |                             model_response, 
101 |                             inputs=[history, speech, text_box, chatbot], 
102 |                             outputs=[history, speech, text_box, chatbot]
103 |                       )
104 |     
105 |     with gr.Row():
106 |         examples_prompt = gr.Examples(
107 |             examples = [
108 |                 "Transcribe the speech accurately.",
109 |                 "What is the primary emotion conveyed by the speaker?",
110 |                 "Describe the content and tone of the audio in detail.",
111 |                 "Provide a summary of the audio content.",
112 |                 "Identify the language spoken in the recording.",
113 |                 "What does the background noise in the audio indicate?",
114 |                 "Identify if the speaker has a specific accent and describe it.",
115 |                 "What is the gender and approximate age of the speaker?",
116 |                 "Summarize the conversation happening in this audio.",
117 |                 "Classify the type of audio: speech, music, noise, or mixed.",
118 |                 "Assess the clarity and intelligibility of the speech.",
119 |                 "What is the emotional state of the speaker, and why do you think so?",
120 |                 "Provide a timestamped breakdown of key events in the audio."
121 |                 "將這段語音轉成文字，請確保準確的時間點。",
122 |                 "你能辨認出這段語音的情感是什麼嗎？",
123 |                 "這段聲音中的說話者有什麼情緒？",
124 |                 "從這段聲音中提取關鍵詞。",
125 |                 "請翻譯這段語音的內容。",
126 |                 "從這段聲音中找出說話者的性別和口音。",
127 |             ],
128 |             inputs=[text_box],
129 |             label="Example prompts"
130 |         )
131 |     with gr.Row():
132 |         examples = gr.Examples(
133 |             examples = [
134 |                 ["assets/audios/0_000307.wav"],
135 |                 ["assets/audios/4_0_d47.wav"],
136 |                 ["assets/audios/7_1_d7.wav"],
137 |                 ["assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav"],
138 |                 ["assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav"],
139 |                 ["assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav"],
140 |                 ["assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac"],
141 |                 ["assets/audios/MUL0608_120.98_148.92.wav"],
142 |                 ["assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav"],
143 |                 ["assets/audios/Ses01F_script03_1_F029.wav"],
144 |                 ["assets/audios/Ses01M_script01_1_F014.wav"],
145 |                 ["assets/audios/Ses04F_impro02_M004.wav"],
146 |                 ["assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac"],
147 |                 ["assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav"],
148 |                 ["assets/audios/common_voice_en_34980360.mp3"],
149 |                 ["assets/audios/p284_159.wav"],
150 |                 ["assets/audios/p287_162.wav"]
151 |             ],
152 |             inputs=[speech],
153 |             label="Example audios"
154 |         )
155 | 
156 | if __name__ == "__main__":
157 |     demo.launch(share=True)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | > [!important]
  2 | > 🚀🚀 Our latest model **DeSTA2.5-Audio** is out! ➡️➡️ [📑 Paper](https://arxiv.org/abs/2507.02768) | [👩‍💻 Github](https://github.com/kehanlu/DeSTA2.5-Audio)
  3 | 
  4 | # DeSTA2
  5 | 
  6 | [📑 Paper](https://arxiv.org/pdf/2409.20007) | [🌐 Website](https://kehanlu.github.io/DeSTA2/) | [👩‍💻 Github](https://github.com/kehanlu/DeSTA2) | [🤗 Model](https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta) | [🤗 Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct)
  7 | 
  8 | **Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data**
  9 | 
 10 | > Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
 11 | 
 12 | This paper introduces DeSTA2, a novel approach for developing instruction-following speech language models (SLMs) without relying on speech instruction-tuning data. By leveraging text-based large language models to generate speech-text pairs from rich metadata, DeSTA2 achieves impressive performance on speech understanding benchmarks while retaining advanced reasoning capabilities.
 13 | 
 14 | ![model](assets/images/model.png)
 15 | 
 16 | ### Demo
 17 | 
 18 | https://github.com/user-attachments/assets/20adfd47-1259-4cdd-9602-306d0d29711b
 19 | 
 20 | ## ✨ News
 21 | - 2025/07/03: 🚀🚀 DeSTA2.5-Audio is out! Check it out: [Github: DeSTA2.5-Audio](https://github.com/kehanlu/DeSTA2.5-Audio)
 22 | - 2025/01/06: Gradio demo for DeSTA2. Simply run: `gradio app.py`.
 23 | - 2025/01/05: Training code for DeSTA2. See [this branch](https://github.com/kehanlu/Nemo/tree/icassp/examples/desta)
 24 | - 2024/12/21: DeSTA2 is accepted to ICASSP 2025!
 25 | - 2024/10/10: We release DeSTA2-8B-beta model. See [🤗 Model](https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta) for details.
 26 | - 2024/10/09: We release the dataset(speech captions). See [🤗 Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct) for details.
 27 | - 2024/10/01: DeSTA2 paper is now available on arXiv. [📑 Paper](https://arxiv.org/pdf/2409.20007)
 28 | 
 29 | 
 30 | ## Usage
 31 | 
 32 | ### Load Model
 33 | 
 34 | **A. Using AutoModel from Huggingface transformers**
 35 | 
 36 | ```python
 37 | HF_TOKEN = "hf_..." # your huggingface token for downloading Llama3 from official Meta repo
 38 | 
 39 | from transformers import AutoModel
 40 | 
 41 | model = AutoModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", trust_remote_code=True, token=HF_TOKEN)
 42 | model.to("cuda")
 43 | ```
 44 | 
 45 | **B. Using DestaModel from this repo**
 46 | 
 47 | ```
 48 | git clone git@github.com:kehanlu/DeSTA2.git
 49 | cd DeSTA2
 50 | pip install -e .
 51 | ```
 52 | 
 53 | ```python
 54 | HF_TOKEN = "hf_..." # your huggingface token for downloading Llama3 from official Meta repo
 55 | 
 56 | from desta import DestaModel
 57 | 
 58 | model = DestaModel.from_pretrained("DeSTA-ntu/DeSTA2-8B-beta", token=HF_TOKEN)
 59 | model.to("cuda")
 60 | ```
 61 | 
 62 | 
 63 | 
 64 | ### Have fun!
 65 | 
 66 | ```python
 67 | messages = [
 68 |             {"role": "system", "content": "You are a helpful voice assistant."},
 69 |             {"role": "audio", "content": "<path_to_audio_file>"},
 70 |             {"role": "user", "content": "Describe the audio."}
 71 |         ]
 72 | 
 73 | generated_ids = model.chat(
 74 |     messages, 
 75 |     max_new_tokens=128, 
 76 |     do_sample=True, 
 77 |     temperature=0.6, 
 78 |     top_p=0.9
 79 | )
 80 | 
 81 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 82 | print(response)
 83 | ```
 84 | 
 85 | 
 86 | ### Examples
 87 | 
 88 | **Complex reasoning ability for arbitrary user input**
 89 | ```python
 90 | messages = [
 91 |             {"role": "system", "content": "Focus on the input audio."},
 92 |             {"role": "audio", "content": "assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav"},
 93 |             {"role": "user", "content": "How many speakers are there in the audio? Can you tell me the emotion of each speaker? Can you also provide a brief summary of the audio? Is there any background noise? How do you feel today? Can you give me a response to the audio?"},
 94 |         ]
 95 | generated_ids = model.chat(messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9)
 96 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 97 | print(response)
 98 | 
 99 | # == MODEL RESPONSE ==
100 | # Based on the given audio snippet, here's my analysis:
101 | 
102 | # 1. Number of speakers: 1
103 | # 2. Emotion of the speaker: Sadness
104 | # 3. Summary of the audio: The speaker is apologizing for causing trouble, indicating a sense of regret and remorse.
105 | # 4. Background noise: None mentioned, but it's possible that there might be some ambient noise that's not audible from the snippet.
106 | # 5. How I feel today: As an AI, I don't have emotions, but I'm designed to provide a neutral and objective analysis.
107 | # 6. Response to the audio: "It's okay, don't apologize."
108 | ```
109 | 
110 | **Math calculation and chain-of-thought reasoning based on the audio content**
111 | ```python
112 | messages = [
113 |             {"role": "system", "content": "Focus on the input audio."},
114 |             {"role": "audio", "content": "assets/audios/7_1_d7.wav"},
115 |             {"role": "user", "content": "How many speakers are there in the audio?\nCan you tell me the emotion of each speaker?(sad=1, angry=2, neutral=3, happy=4)?\nHow long is the audio in seconds? \n\nDo the math multiply the numbers from above 3 questions. Think step by step."},
116 |         ]
117 | 
118 | generated_ids = model.chat(messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9)
119 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
120 | print(response)
121 | 
122 | # == MODEL RESPONSE ==
123 | # Let's break it down step by step:
124 | # 1. How many speakers are there in the audio?
125 | # There is only 1 speaker.
126 | # 2. Can you tell me the emotion of each speaker?
127 | # The emotion is happiness, which corresponds to a value of 4.
128 | # 3. How long is the audio in seconds?
129 | # The audio is 2 seconds long.
130 | 
131 | # Now, let's do the math:
132 | # 1 x 4 x 2 = 8
133 | # So, the answer is 8!
134 | ```
135 | 
136 | **Multilingual ability**
137 | > Note: the model only trained on English data, but it is generalizable to other languages thanks to the multilingual capability of Llama3! 😊
138 | 
139 | ```python
140 | messages = [
141 |             {"role": "system", "content": "請仔細聽輸入的音訊。"},
142 |             {"role": "audio", "content": "assets/audios/7_1_d7.wav"},
143 |             {"role": "user", "content": "請描述這段聲音，聲音聽起來如何？內容說了什麼？\n以中文回答。"},
144 |         ]
145 | 
146 | generated_ids = model.chat(messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9)
147 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
148 | print(response)
149 | 
150 | # == MODEL RESPONSE ==
151 | # 這段聲音是一個女性的聲音，聽起來是很愉快的。她說了一句「Thank you for calling. Goodbye」，語氣溫柔、柔和，沒有任何激動或激情的感覺。整個聲音聽起來很平靜、很自然，沒有任何明顯的感覺或情感色彩。
152 | ```
153 | 
154 | **Multi-turn question-answering**
155 | ```python
156 | messages = [
157 |             {"role": "system", "content": "Focus on the audio clip."},
158 |             {"role": "audio", "content": "assets/audios/7_1_d7.wav"},
159 |             {"role": "user", "content": "Transcribe the audio."},
160 |         ]
161 | 
162 | generated_ids = model.chat(messages, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)
163 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
164 | print(response)
165 | 
166 | # == MODEL RESPONSE ==
167 | # Here is the transcription of the audio clip:
168 | # "Thank you for calling. Goodbye."
169 | 
170 | messages.append({"role": "assistant", "content": response})
171 | messages.append({"role": "user", "content": "Who is the speaker talking to?"})
172 | generated_ids = model.chat(messages, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)
173 | response = model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
174 | print(response)
175 | 
176 | # == MODEL RESPONSE ==
177 | # The speaker is speaking to someone who has called them, but it's not specified who that person is. The conversation appears to be a brief and abrupt goodbye, suggesting that the speaker may not have had a conversation with the caller or may have been interrupted.
178 | ```
179 | 
180 | ## Citation
181 | 
182 | if you find our work useful, please consider citing the paper:
183 | 
184 | ```
185 | @article{lu2024developing,
186 |   title={Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data},
187 |   author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi},
188 |   journal={arXiv preprint arXiv:2409.20007},
189 |   year={2024}
190 | }
191 | 
192 | @inproceedings{lu24c_interspeech,
193 |   title     = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment},
194 |   author    = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee},
195 |   year      = {2024},
196 |   booktitle = {Interspeech 2024},
197 |   pages     = {4159--4163},
198 |   doi       = {10.21437/Interspeech.2024-457},
199 |   issn      = {2958-1796},
200 | }
201 | ```
202 | 


--------------------------------------------------------------------------------
/desta/modeling_desta.py:
--------------------------------------------------------------------------------
  1 | from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, WhisperForConditionalGeneration, PretrainedConfig, PreTrainedModel, BertConfig, AutoProcessor
  2 | from transformers.models.bert.modeling_bert import BertEncoder
  3 | from torch import nn
  4 | import torch
  5 | import os
  6 | import librosa
  7 | 
  8 | class Desta2Config(PretrainedConfig):
  9 |     model_type = "DestaModel"
 10 | 
 11 |     def __init__(
 12 |         self,
 13 |         llama_model_id="meta-llama/Meta-Llama-3-8B-Instruct",
 14 |         whisper_model_id="openai/whisper-small",
 15 |         prompt_size=64,
 16 |         **kwargs
 17 |     ):
 18 |         super().__init__(**kwargs)
 19 |         self.llama_model_id = llama_model_id
 20 |         self.whisper_model_id = whisper_model_id
 21 |         self.prompt_size = prompt_size
 22 | 
 23 |         self.whisper_config = AutoConfig.from_pretrained(self.whisper_model_id)
 24 |         self.llama_config = AutoConfig.from_pretrained(self.llama_model_id)
 25 | 
 26 | class QformerConnector(PreTrainedModel):
 27 |     def __init__(self, cfg):
 28 |         super().__init__(cfg)
 29 |         self.cfg = cfg
 30 |         
 31 |         
 32 |         if self.cfg.whisper_model_id == "openai/whisper-medium":
 33 |             self.target_layer_ids = [5, 11, 17, 23]
 34 |         elif self.cfg.whisper_model_id == "openai/whisper-small":
 35 |             self.target_layer_ids = [2, 5, 8, 11]
 36 |         elif self.cfg.whisper_model_id == "openai/whisper-tiny":
 37 |             self.target_layer_ids = [0,1,2,3]
 38 |         elif self.cfg.whisper_model_id == "openai/whisper-large-v3":
 39 |             self.target_layer_ids = [3, 7, 11, 15, 19, 23, 27, 31]
 40 |         else:
 41 |             raise NotImplementedError(f"model_id {self.cfg.whisper_model_id} not implemented")
 42 | 
 43 | 
 44 |         self.layer_prompts = nn.ParameterList([
 45 |             nn.Parameter(torch.randn(1, self.cfg.prompt_size, self.cfg.whisper_config.d_model)) for _ in range(len(self.target_layer_ids))]
 46 |         )
 47 |         
 48 |         
 49 |         # (prompt_size, target_layers)
 50 |         self.layer_weights = nn.Parameter(torch.zeros(self.cfg.prompt_size, len(self.target_layer_ids), dtype=torch.float))
 51 | 
 52 |         qformer_config = BertConfig()
 53 |         qformer_config.num_hidden_layers = 2
 54 |         qformer_config.num_attention_heads = self.cfg.whisper_config.encoder_attention_heads
 55 |         qformer_config.hidden_size = self.cfg.whisper_config.d_model
 56 |         qformer_config.add_cross_attention = True
 57 |         qformer_config.is_decoder = True
 58 | 
 59 |         self.qformer = BertEncoder(qformer_config)
 60 |         self.proj = nn.Sequential(
 61 |                 nn.LayerNorm(self.cfg.whisper_config.d_model),
 62 |                 nn.Linear(self.cfg.whisper_config.d_model, self.cfg.llama_config.hidden_size) # project to llama hidden size
 63 |             )
 64 |     
 65 |     def forward(self, encoder_hidden_states):
 66 |         layer_prompt_outputs = []
 67 |         for idx, encoder_hidden_state in enumerate(encoder_hidden_states):
 68 |             if idx in self.target_layer_ids:
 69 |                 layer_prompt = self.layer_prompts[self.target_layer_ids.index(idx)].expand(encoder_hidden_state.size(0), -1, -1)
 70 |                 qformer_output = self.qformer(
 71 |                     hidden_states=layer_prompt,
 72 |                     encoder_hidden_states=encoder_hidden_state,
 73 |                 )
 74 |                 layer_prompt_output = qformer_output.last_hidden_state
 75 |                 layer_prompt_outputs.append(layer_prompt_output)
 76 |         
 77 |         layer_prompt_outputs = torch.stack(layer_prompt_outputs, dim=0)
 78 |         layer_prompt_outputs = layer_prompt_outputs.permute(1, 2, 0, 3)
 79 |         
 80 |         self.norm_weights = torch.nn.functional.softmax(self.layer_weights, dim=-1).unsqueeze(-1)
 81 |         
 82 |         output = (layer_prompt_outputs * self.norm_weights).sum(dim=2) # (b, prompt_size, d_model)
 83 |         
 84 |         output = self.proj(output)
 85 |         
 86 |         return output
 87 | 
 88 | class SpeechPerception(PreTrainedModel):
 89 |     def __init__(self, cfg):
 90 |         super().__init__(cfg)
 91 |         self.cfg = cfg
 92 | 
 93 |         self.whisper = WhisperForConditionalGeneration.from_pretrained(cfg.whisper_model_id)
 94 |         self.processor = AutoProcessor.from_pretrained(cfg.whisper_model_id)
 95 | 
 96 |         self.connector = QformerConnector(cfg)
 97 | 
 98 |     def generate(self, input_features):
 99 |         input_features = input_features.to(self.whisper.device)
100 |         
101 |         outputs = self.whisper.generate(input_features=input_features, return_dict_in_generate=True, output_hidden_states=True) # here we use default generate config for whisper
102 | 
103 |         transcriptions = self.processor.batch_decode(outputs.sequences, skip_special_tokens=True)[0]
104 |         speech_features = self.connector(outputs.encoder_hidden_states)
105 | 
106 |         return transcriptions, speech_features
107 | 
108 | 
109 | class DestaModel(PreTrainedModel):
110 |     config_class = Desta2Config
111 | 
112 |     def __init__(self, config, **kwargs):
113 |         super().__init__(config)
114 | 
115 |         self.speech_perception = SpeechPerception(config)
116 |         self.llama = AutoModelForCausalLM.from_pretrained(config.llama_model_id, torch_dtype=torch.bfloat16, **kwargs)
117 |         self.tokenizer = AutoTokenizer.from_pretrained(config.llama_model_id, **kwargs)
118 |         
119 | 
120 |     def chat(self, messages, max_new_tokens=128, do_sample=True, temperature=0.6, top_p=0.9):
121 |         """
122 |         messages: list of dicts with keys "role" and "content"
123 |         ```
124 |         [
125 |             {"role": "system", "content": "You are a helpful voice assistant."},
126 |             {"role": "audio", "content": "<path_to_audio_file>"},
127 |             {"role": "user", "content": "Describe the audio."}
128 |         ]
129 |         ```
130 |         """
131 | 
132 |         audio_path, input_features = self.load_audio(messages)
133 |         transcription, audio_features = self.speech_perception.generate(input_features)
134 |         inputs, audio_position = self.process_text(messages, audio_path, transcription)
135 | 
136 |         inputs_embeds, attention_mask = self.prepare_llm_input(
137 |             input_ids=inputs.input_ids, 
138 |             attention_mask=inputs.attention_mask, 
139 |             audio_position=audio_position,
140 |             audio_features=audio_features
141 |         )
142 | 
143 |         outputs = self.llama.generate(
144 |             inputs_embeds=inputs_embeds,
145 |             attention_mask=attention_mask,
146 |             pad_token_id=self.tokenizer.eos_token_id,
147 |             max_new_tokens=max_new_tokens,
148 |             do_sample=do_sample,
149 |             temperature=temperature,
150 |             top_p=top_p,
151 |         )
152 |         return outputs
153 | 
154 |     def process_text(self, messages, audio_path, transcription):
155 |         context = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
156 |         left_text, right_text = context.split(audio_path)
157 |         right_text = transcription + right_text # 
158 |         
159 |         audio_position = len(self.tokenizer.tokenize(left_text))
160 |         context = left_text + right_text
161 | 
162 |         inputs = self.tokenizer(context, return_tensors="pt")
163 | 
164 |         return inputs, audio_position
165 | 
166 | 
167 |     def prepare_llm_input(self, input_ids, attention_mask, audio_position, audio_features):
168 |         input_ids = input_ids.to(self.llama.device)
169 |         attention_mask = attention_mask.to(self.llama.device)
170 |         audio_features = audio_features.to(self.llama.device)
171 |         audio_feature_length = audio_features.size(1)
172 | 
173 |         inputs_embeds = self.llama.model.embed_tokens(input_ids) # [bs, seq_len, hidden_size]
174 | 
175 | 
176 |         inputs_embeds = torch.cat([inputs_embeds[0, :audio_position], audio_features[0, :], inputs_embeds[0, audio_position:]], dim=0)
177 |         attention_mask = torch.cat([attention_mask[0, :audio_position], torch.ones([ audio_feature_length], dtype=torch.long, device=self.llama.device), attention_mask[0, audio_position:]], dim=0)
178 | 
179 |         inputs_embeds = inputs_embeds.to(self.llama.dtype)
180 |         attention_mask = attention_mask.to(self.llama.dtype)
181 |         return inputs_embeds.unsqueeze(0), attention_mask.unsqueeze(0)
182 | 
183 |     
184 |     def load_audio(self, messages):
185 |         audio_path = None
186 |         for message in messages:
187 |             if message["role"] == "audio" and audio_path is not None:
188 |                 raise ValueError("Multiple audio file paths found in messages. We only support one audio file per message at this moment.")
189 |             if message["role"] == "audio":
190 |                 audio_path = message["content"]
191 |         if audio_path is None:
192 |             raise ValueError("No audio file path found in messages")
193 |         audio, ori_sr = librosa.load(audio_path)
194 |         audio = librosa.resample(audio, orig_sr=ori_sr, target_sr=16000)
195 |         input_features = self.speech_perception.processor(audio, sampling_rate=16000, return_tensors="pt").input_features
196 | 
197 |         return audio_path, input_features
198 |     
199 |     @classmethod
200 |     def from_pretrained(cls, pretrained_model_name_or_path, *model_args, config=None,**kwargs):
201 |         config = cls.config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
202 |         model = cls(config, **kwargs)
203 | 
204 |         if os.path.isdir(pretrained_model_name_or_path):
205 |             model.speech_perception.connector.load_state_dict(
206 |                 torch.load(os.path.join(pretrained_model_name_or_path, "qformer_connector.pth"))
207 |             )
208 |         else:
209 |             from huggingface_hub import hf_hub_download
210 |             path = hf_hub_download(repo_id=pretrained_model_name_or_path, filename="qformer_connector.pth")
211 |             model.speech_perception.connector.load_state_dict(
212 |                 torch.load(path)
213 |             )
214 | 
215 |         return model
216 |     


--------------------------------------------------------------------------------
/index.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html lang="en" data-theme="light">
  3 |   <head>
  4 |     <meta charset="UTF-8" />
  5 |     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  6 |     <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@1.0.2/css/bulma.min.css" />
  7 |     <link rel="stylesheet" href="assets/css/styles.css" />
  8 |     <title>DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data</title>
  9 |   </head>
 10 |   <body>
 11 |     <section class="hero">
 12 |       <div class="container">
 13 |         <div class="hero-body">
 14 |           <p class="title is-1">
 15 |             DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
 16 |           </p>
 17 |           <div class="content is-medium">
 18 |             <p class="subtitle">
 19 |               Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank
 20 |               Wang, Hung-yi Lee
 21 |             </p>
 22 |             <p class="subtitle">National Taiwan University, NVIDIA</p>
 23 | 
 24 |             <a href="https://arxiv.org/pdf/2409.20007">📑 Paper</a> |
 25 |             <a href="https://kehanlu.github.io/DeSTA2/">🌐 Website</a> |
 26 |             <a href="https://github.com/kehanlu/DeSTA2">👩‍💻 Github</a> |
 27 |             <a href="https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta">🤗 Model</a> |
 28 |             <a href="https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct">🤗 Dataset</a>
 29 | 
 30 |             <!-- <p>Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language
 31 |             models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo
 32 |             extensive speech instruction-tuning to bridge the gap between speech and text modalities. This
 33 |             requires significant annotation efforts and risks catastrophic forgetting of the original language
 34 |             capabilities. In this work, we present a simple yet effective automatic process for creating
 35 |             speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs
 36 |             while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates
 37 |             general capabilities for speech-related tasks without the need for speech instruction tuning data,
 38 |             achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our
 39 |             model exhibits the ability to follow complex instructions derived from LLMs, such as specific output
 40 |             formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and
 41 |             effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for
 42 |             more efficient and capable speech understanding systems.
 43 |           </p> -->
 44 |           </div>
 45 |           <img src="assets/images/figure1.png" style="border-radius: 50px" alt="" />
 46 |         </div>
 47 |       </div>
 48 |     </section>
 49 | 
 50 |     <section class="section">
 51 |       <div class="container">
 52 |         <div class="box">
 53 |           <div class="content is-medium">
 54 |             <h2 class="title is-4">✨✨ Key features</h2>
 55 | 
 56 |             <ul>
 57 |               <li>
 58 |                 We propose an automatic approach to develop speech language models <em><b>without</b></em> using speech
 59 |                 instruction-tuning data. We aim to inject speech understanding capability into text-based LLMs while
 60 |                 keeping their original versatility.
 61 |               </li>
 62 |               <li>
 63 |                 We train the end-to-end speech language model with carefully constructed
 64 |                 <a href="#dataset-construction">speech captions</a> to avoid the catastrophic forgetting problem.
 65 |               </li>
 66 |               <li>
 67 |                 The model shows strong <a href="#instruction-following">instruction-following ability</a> and
 68 |                 demonstrates robust speech understanding capability on the
 69 |                 <a href="#dynamic-superb">Dynamic-SUPERB</a> and
 70 |                 <a href="#AIR-Bench-Chat">AIR-Bench-Chat</a> benchmarks.
 71 |               </li>
 72 |             </ul>
 73 |           </div>
 74 |         </div>
 75 |       </div>
 76 |     </section>
 77 | 
 78 |     <section class="section">
 79 |       <div class="container">
 80 |         <div class="box">
 81 |           <div class="content">
 82 |             <h2 class="title is-2" id="instruction-following">Instruction-following Ability</h2>
 83 |             <p>
 84 |               Our model not only demonstrates strong speech understanding capabilities but also excels in following
 85 |               complex instructions, such as following specific output formatting and chain-of-thought reasoning.
 86 |             </p>
 87 |           </div>
 88 | 
 89 |           <audio controls>
 90 |             <source src="assets/audios/7_1_d7.wav" type="audio/ogg" />
 91 |           </audio>
 92 |           <table class="table is-fullwidth">
 93 |             <thead>
 94 |               <tr>
 95 |                 <th>Instruction</th>
 96 |                 <th>Model Output</th>
 97 |               </tr>
 98 |             </thead>
 99 |             <tbody>
100 |               <tr>
101 |                 <td>
102 |                   What is the gender of the speaker?
103 |                   <b>Your entire response should be in English, capital letters only.</b>
104 |                 </td>
105 |                 <td>THE GENDER OF THE SPEAKER IS FEMALE.</td>
106 |               </tr>
107 |               <tr>
108 |                 <td>
109 |                   What is the emotion of the speaker?
110 |                   <b>Your entire response should be in English, capital letters only.</b>
111 |                 </td>
112 |                 <td>HAPPINESS</td>
113 |               </tr>
114 |               <tr>
115 |                 <td>What is the gender of the speaker? <b>Answer in JSON format.</b></td>
116 |                 <td>{"gender": "Female" }</td>
117 |               </tr>
118 |               <tr>
119 |                 <td>What is the emotion conveyed in the speech? <b>Think step by step.</b></td>
120 |                 <td>
121 |                   Let's break it down step by step!<br />1. We start with the phrase "Thank you for calling". This
122 |                   implies that a request has been made, and the speaker is expressing gratitude.<br />2. The next phrase
123 |                   is "Goodbye". This is a normal way to end a conversation, and it's a polite way to part ways.<br />Considering
124 |                   these two phrases together, we can infer that the speaker is happy to end the conversation and is not
125 |                   expecting anything further. The tone is friendly and grateful, suggesting that the speaker has had a
126 |                   positive interaction.<br />Therefore, the emotion conveyed in the speech is happiness!
127 |                 </td>
128 |               </tr>
129 |               <tr>
130 |                 <td><b>Analyze the speech,</b> then tell me what is the emotion conveyed in the speech?</td>
131 |                 <td>
132 |                   Let's break down the speech:<br />* "Thank you for calling": This phrase is a polite expression of
133 |                   gratitude for the person's call.<br />* "Goodbye": This is a final farewell, indicating that the
134 |                   conversation is ending.<br />The tone of the speech is polite and friendly, as the speaker is thanking
135 |                   the person for calling and bidding them farewell.<br />The emotion conveyed in this speech is
136 |                   happiness. The speaker seems cheerful and appreciative, expressing gratitude for the call and ending
137 |                   the conversation on a positive note.
138 |                 </td>
139 |               </tr>
140 |               <tr>
141 |                 <td>Calculate the words per second rate, <b>think step by step.</b></td>
142 |                 <td>
143 |                   Let's break it down step by step:<br />1. The speech lasts for 2 seconds: [00:00:00 - 00:00:02]<br />2.
144 |                   The number of words is: "Thank you for calling. Goodbye." (5 words)<br />3. To find the words per
145 |                   second rate, we need to divide the number of words by the time it took to say the words:<br />Words
146 |                   per second = Total number of words / Time taken (in seconds)<br />= 5 words / 2 seconds<br />= 2.5
147 |                   words per second<br />Therefore, the words per second rate is 2.5 words
148 |                 </td>
149 |               </tr>
150 | 
151 |               <tr>
152 |                 <td>
153 |                   How many speakers are there in the audio? Can you tell me the emotion of each speaker? Can you also
154 |                   provide a brief summary of the audio? Is there any background noise? How do you feel today? Can you
155 |                   give me a response to the audio?
156 |                 </td>
157 |                 <td>
158 |                   There is only 1 speaker in the audio.<br />
159 | 
160 |                   The speaker is a female, and her emotion is happiness.<br />
161 | 
162 |                   The audio is a brief goodbye message, where the speaker says "Thank you for calling. Goodbye."<br />
163 | 
164 |                   There is no background noise in the audio.<br />
165 | 
166 |                   As a machine, I don't have emotions, so I don't feel anything.<br />
167 | 
168 |                   As for a response to the audio, I can say: "You're welcome, and have a great day!"
169 |                 </td>
170 |               </tr>
171 |               <tr>
172 |                 <td>
173 |                   請描述這段聲音，聲音聽起來如何？內容說了什麼？ 以中文回答。<br /><br /><em
174 |                     >Note: our model only trained on English data, but it is generalizable to other languages thanks to
175 |                     the multilingual capability of Llama3! 😊
176 |                   </em>
177 |                 </td>
178 |                 <td>
179 |                   這段聲音是一個女性的聲音，聽起來是很愉快的。她說了一句「Thank you for calling.
180 |                   Goodbye」，語氣溫柔、柔和，沒有任何激動或激情的感覺。整個聲音聽起來很平靜、很自然，沒有任何明顯的感覺或情感色彩。
181 |                 </td>
182 |               </tr>
183 | 
184 |               <tr>
185 |                 <td>
186 |                   <b>Multi-turn question-answering</b><br />
187 |                   <em style="background-color: #f0f0f0">First question</em> <br />
188 |                   Transcribe the audio. <br /><br />
189 |                   <em style="background-color: #f0f0f0">Second question</em> <br />
190 |                   Who is the speaker talking to?
191 |                 </td>
192 |                 <td>
193 |                   <em style="background-color: #f0f0f0">First response</em><br />
194 |                   Here is the transcription of the audio clip: <br />
195 |                   "Thank you for calling. Goodbye." <br />
196 |                   <br />
197 |                   <em style="background-color: #f0f0f0">Second response</em><br />
198 |                   The speaker is speaking to someone who has called them, but it's not specified who that person is. The
199 |                   conversation appears to be a brief and abrupt goodbye, suggesting that the speaker may not have had a
200 |                   conversation with the caller or may have been interrupted.
201 |                 </td>
202 |               </tr>
203 |             </tbody>
204 |           </table>
205 |         </div>
206 |       </div>
207 |     </section>
208 | 
209 |     <section class="section">
210 |       <div class="container">
211 |         <div class="box">
212 |           <h2 class="title is-2">Speech Understanding Ability</h2>
213 |           <div class="content is-medium">
214 |             <h2 class="title is-4" id="dynamic-superb">Dynamic-SUPERB</h2>
215 |             <p>
216 |               <a href="https://ieeexplore.ieee.org/document/10448257">Dynamic-SUPERB</a> is a crowdsourced benchmark
217 |               comprising 48 speech-related classification tasks with human-created instructions and answers.
218 |             </p>
219 |           </div>
220 | 
221 |           <div class="table-container">
222 |             <table class="table is-fullwidth">
223 |               <thead>
224 |                 <tr>
225 |                   <th>Audio</th>
226 |                   <th>Instruction</th>
227 |                   <th>Model Output</th>
228 |                   <th>Ground Truth</th>
229 |                 </tr>
230 |               </thead>
231 |               <tbody>
232 |                 <tr>
233 |                   <td>
234 |                     <audio controls>
235 |                       <source src="assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac" type="audio/ogg" />
236 |                     </audio>
237 |                     <p>LanguageIdentification</p>
238 |                   </td>
239 |                   <td>
240 |                     Which language can you recognize from this audio recording? The answer could be German, English,
241 |                     Spanish, Italian, Russian, or French.
242 |                   </td>
243 |                   <td>
244 |                     Based on the audio recording, I recognize the language as German. The phrase "Die hätte man dicht
245 |                     machen müssen" is a grammatically correct and idiomatic German sentence, and the speaker's accent
246 |                     and pronunciation also suggest a German speaker.
247 |                   </td>
248 |                   <td>German</td>
249 |                 </tr>
250 |                 <tr>
251 |                   <td>
252 |                     <audio controls>
253 |                       <source
254 |                         src="assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav"
255 |                         type="audio/ogg"
256 |                       />
257 |                     </audio>
258 |                     <p>NoiseDetection</p>
259 |                   </td>
260 |                   <td>
261 |                     Determine whether the sentence is a noise-free utterance or a noisy one with music disturbance. The
262 |                     answer could be clean or noisy.
263 |                   </td>
264 |                   <td>
265 |                     Based on the provided information, I would classify the sentence as "noisy". The noise level is
266 |                     moderate (15db), and there is a reverberation time of 60ms, which suggests that there is some
267 |                     background noise or music disturbance present.
268 |                   </td>
269 |                   <td>Noisy</td>
270 |                 </tr>
271 |                 <tr>
272 |                   <td>
273 |                     <audio controls>
274 |                       <source
275 |                         src="assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav"
276 |                         type="audio/ogg"
277 |                       />
278 |                     </audio>
279 |                     <p>AccentClassification</p>
280 |                   </td>
281 |                   <td>
282 |                     The task involves detecting variations in pronunciation and intonation. The answer could be
283 |                     american, australian, bangla, british, indian, malayalam, odiya, telugu, or welsh.
284 |                   </td>
285 |                   <td>
286 |                     Based on the accent and pronunciation, I would identify the speaker as having a British accent.
287 |                   </td>
288 |                   <td>British</td>
289 |                 </tr>
290 |                 <tr>
291 |                   <td>
292 |                     <audio controls>
293 |                       <source
294 |                         src="assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav"
295 |                         type="audio/ogg"
296 |                       />
297 |                     </audio>
298 |                     <p>EmotionRecognition</p>
299 |                   </td>
300 |                   <td>
301 |                     Recognize and categorize the emotions expressed in spoken language. The answer could be anger,
302 |                     disgust, sadness, joy, neutral, surprise, or fear.
303 |                   </td>
304 |                   <td>
305 |                     Based on the audio clip, the emotion expressed is "joy". The speaker's tone and words ("Here she is,
306 |                     future fashion superstar") convey a sense of excitement and happiness, indicating a positive
307 |                     emotional state.
308 |                   </td>
309 |                   <td>Joy</td>
310 |                 </tr>
311 |                 <tr>
312 |                   <td>
313 |                     <audio controls>
314 |                       <source
315 |                         src="assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav"
316 |                         type="audio/ogg"
317 |                       />
318 |                     </audio>
319 |                     <p>DialogueEmotionClassification</p>
320 |                   </td>
321 |                   <td>
322 |                     Analyze the dialogue and identify its corresponding emotion. The answer could be anger, disgust,
323 |                     fear, sadness, happiness, surprise, or no emotion.
324 |                   </td>
325 |                   <td>
326 |                     Based on the dialogue, the corresponding emotion is sadness. The speaker is apologizing and
327 |                     expressing regret for causing trouble, which is a characteristic of a sad tone.
328 |                   </td>
329 |                   <td>Sadness</td>
330 |                 </tr>
331 |                 <tr>
332 |                   <td>
333 |                     <audio controls>
334 |                       <source src="assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav" />
335 |                     </audio>
336 |                     <p>SpeechTextMatching</p>
337 |                   </td>
338 |                   <td>
339 |                     Verify if the speech and text are congruent. The answer could be yes or no.<br />Text: "guidelines
340 |                     for an experimental program to develop more detailed criteria"
341 |                   </td>
342 |                   <td>
343 |                     The answer is NO.<br />The speech is about saying a few words on the principles of design and
344 |                     typography, but the text is about guidelines for an experimental program to develop more detailed
345 |                     criteria, which is a different topic.
346 |                   </td>
347 |                   <td>No</td>
348 |                 </tr>
349 |                 <tr>
350 |                   <td>
351 |                     <audio controls>
352 |                       <source
353 |                         src="assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac"
354 |                       />
355 |                     </audio>
356 |                     <p>SpeakerVerification</p>
357 |                   </td>
358 |                   <td>
359 |                     Scrutinize the first audio clip to establish whether the second one is delivered by the same
360 |                     speaker. The answer could be yes or no.
361 |                   </td>
362 |                   <td>
363 |                     Based on the provided audio clip, I would say NO, the second audio clip is not delivered by the same
364 |                     speaker. The tone, pitch, and emotion expressed in the first audio clip (neutral, calm, and
365 |                     introspective) are different from what I would expect from the same speaker. The first audio clip
366 |                     has a more contemplative and reflective tone, whereas the second audio clip is more dramatic and
367 |                     emotive.
368 |                   </td>
369 |                   <td>No</td>
370 |                 </tr>
371 |               </tbody>
372 |             </table>
373 |           </div>
374 | 
375 |           <div class="content is-medium">
376 |             <h2 class="title is-4" id="AIR-Bench-Chat">AIR-Bench-Chat</h2>
377 |             <p>
378 |               <a href="https://arxiv.org/abs/2402.07729">AIR-Bench-Chat</a> is an LLM-generated benchmark derived from
379 |               speech metadata that focuses on open-ended questions.
380 |             </p>
381 |           </div>
382 |           <div class="table-container">
383 |             <table class="table is-fullwidth">
384 |               <thead>
385 |                 <tr>
386 |                   <th>Audio</th>
387 |                   <th>Instruction</th>
388 |                   <th>Model Output</th>
389 |                   <th>Ground Truth</th>
390 |                 </tr>
391 |               </thead>
392 |               <tbody>
393 |                 <tr>
394 |                   <td>
395 |                     <audio controls>
396 |                       <source src="assets/audios/Ses01M_script01_1_F014.wav" />
397 |                     </audio>
398 |                   </td>
399 |                   <td>
400 |                     Based on the speaker's emotional state, what could be the nature of the proof being requested?
401 |                   </td>
402 |                   <td>
403 |                     Based on the speaker's angry tone and high-pitched voice, it is likely that the proof being
404 |                     requested is related to a dispute or controversy. The speaker is demanding evidence to support a
405 |                     claim or argument, and is likely feeling frustrated or upset that the other party has not provided
406 |                     sufficient evidence to back up their claims.
407 |                   </td>
408 |                   <td>
409 |                     Given the speaker's anger, the proof likely relates to a serious allegation or an issue that
410 |                     significantly impacts the speaker..
411 |                   </td>
412 |                 </tr>
413 |                 <tr>
414 |                   <td>
415 |                     <audio controls>
416 |                       <source src="assets/audios/Ses04F_impro02_M004.wav" />
417 |                     </audio>
418 |                   </td>
419 |                   <td>
420 |                     What emotion is the speaker likely trying to convey to the listener by stating a fear of their
421 |                     departure?
422 |                   </td>
423 |                   <td>
424 |                     Based on the given information, the speaker is likely trying to convey a sense of sadness to the
425 |                     listener by stating a fear of their departure. The low pitch, slow speaking speed, and very low
426 |                     volume suggest a somber and melancholic tone, which is consistent with a sad emotion.
427 |                   </td>
428 |                   <td>The speaker is trying to convey a sense of sadness and attachment to the listener.</td>
429 |                 </tr>
430 |                 <tr>
431 |                   <td>
432 |                     <audio controls>
433 |                       <source src="assets/audios/Ses01F_script03_1_F029.wav" />
434 |                     </audio>
435 |                   </td>
436 |                   <td>What could be the next step for the manager after witnessing such a scene?</td>
437 |                   <td>
438 |                     Based on the manager's tone and emotion, it seems that she is amused and entertained by the scene.
439 |                     Given her high pitch and volume, she might be trying to emphasize the absurdity of the situation.
440 |                     Considering her slow speaking speed, she might be savoring the moment and enjoying the spectacle.<br /><br />As
441 |                     a next step, the manager could:<br />* Laugh and join in on the fun, possibly even taking a photo or
442 |                     video to capture the moment.<br />* Try to intervene and calm the situation down, but in a playful
443 |                     and lighthearted manner, acknowledging the absurdity of the situation.<br />* Use the opportunity to
444 |                     make a humorous
445 |                   </td>
446 |                   <td>The manager might need to intervene or address the situation occurring on the floor.</td>
447 |                 </tr>
448 |                 <tr>
449 |                   <td>
450 |                     <audio controls>
451 |                       <source src="assets/audios/MUL0608_120.98_148.92.wav" />
452 |                     </audio>
453 |                   </td>
454 |                   <td>What did the second person inquire about after the first person's request?</td>
455 |                   <td>The second person inquired about the price range of the restaurant.</td>
456 |                   <td>The second person inquired about any specifications on the price range.</td>
457 |                 </tr>
458 |               </tbody>
459 |             </table>
460 |           </div>
461 |         </div>
462 |       </div>
463 |     </section>
464 | 
465 |     <section class="section">
466 |       <div class="container">
467 |         <div class="box">
468 |           <h1 class="title is-2">Method</h1>
469 |           <div class="content">
470 |             <h2 class="title is-3" id="dataset-construction">Stage 1: Dataset Consturction</h2>
471 | 
472 |             <p>
473 |               In our dataset construction process, we first extract and format all textual attributes from the speech
474 |               into a structured "seed transcript." This seed transcript is then combined with the prompt "What can you
475 |               hear from the audio?" and fed into Llama3. The response generated by Llama3 becomes our training target
476 |               for the end-to-end speech language model.
477 |             </p>
478 |             <p>
479 |               There are 12 attributes, including gender, age, accent, emotion, pitch, volume, speaking speed, SNR level,
480 |               C50 value, duration, intent, and spoken text.
481 |             </p>
482 | 
483 |             <img src="assets/images/dataset_construction.png" alt="" />
484 |             <div class="table-container">
485 |               <table class="table is-fullwidth">
486 |                 <thead>
487 |                   <tr>
488 |                     <th style="color: orange">Seed Transcript</th>
489 |                     <th>Llama3-Instruct Response</th>
490 |                     <th>Correspond Audio</th>
491 |                   </tr>
492 |                 </thead>
493 |                 <tbody>
494 |                   <tr>
495 |                     <td>
496 |                       [00:00:01 - 00:00:02]: "These days are long gone." (Gender:Male, Accent: Scottish, Age: 20,
497 |                       Emotion: neutral, Duration: 2s)
498 |                     </td>
499 |                     <td>
500 |                       From the audio, I can hear:<br />* A male voice with a Scottish accent<br />* The voice is
501 |                       approximately 20 years old<br />* The tone is neutral, indicating a lack of strong emotions<br />*
502 |                       The audio duration is 2 seconds
503 |                     </td>
504 |                     <td>
505 |                       <audio controls>
506 |                         <source src="assets/audios/p284_159.wav" />
507 |                       </audio>
508 |                       <p>VCTK-Corpus</p>
509 |                     </td>
510 |                   </tr>
511 |                   <tr>
512 |                     <td>
513 |                       [00:00:01 - 00:00:04]: "At the top level, the salaries are far too high." (Gender:Male, Accent:
514 |                       English, Age: 23, Emotion: neutral, Duration: 4s)
515 |                     </td>
516 |                     <td>
517 |                       From the audio, we can hear a male speaker with an English accent (possibly British English)
518 |                       saying:<br />"At the top level, the salaries are far too high."<br />The speaker is a 23-year-old
519 |                       male with a neutral emotion. The audio lasts for 4 seconds.
520 |                     </td>
521 |                     <td>
522 |                       <audio controls>
523 |                         <source src="assets/audios/p287_162.wav" />
524 |                       </audio>
525 |                       <p>VCTK-Corpus</p>
526 |                     </td>
527 |                   </tr>
528 |                   <tr>
529 |                     <td>
530 |                       [00:00:01 - 00:00:04]: "Oh, lovely. Where's the wedding going to be?" (Gender:Male, Emotion:
531 |                       happiness, Action: question, Duration: 4s)
532 |                     </td>
533 |                     <td>
534 |                       From the audio, I can hear a male speaker asking a question with a tone of happiness, specifically
535 |                       the words: "Oh, lovely. Where's the wedding going to be?"
536 |                     </td>
537 |                     <td>
538 |                       <audio controls>
539 |                         <source src="assets/audios/4_0_d47.wav" />
540 |                       </audio>
541 |                       <p>Dailytalk</p>
542 |                     </td>
543 |                   </tr>
544 |                   <tr>
545 |                     <td>
546 |                       [00:00:01 - 00:00:07]: "A new pride taught me mine ego, and that teach I unto men: no longer to
547 |                       thrust one's head into the sand of celestial things, but to carry it freely, a terrestrial head,
548 |                       which giveth meaning to the earth!" (Gender:Male, Emotion: cheerful, Pitch: Normal, Volume:
549 |                       Normal, Speaking speed: Fast, Duration: 7s)
550 |                     </td>
551 |                     <td>
552 |                       From the given audio, we can hear a male speaker with a cheerful tone, speaking at a normal pitch
553 |                       and volume. The speaking speed is relatively fast, and the duration is 7 seconds.<br />The audio
554 |                       appears to be a philosophical or inspirational quote, with the speaker using poetic language to
555 |                       convey a message. The text itself is a condensed and somewhat abstract statement, but it seems to
556 |                       be encouraging people to be more grounded and down-to-earth, rather than getting lost in lofty or
557 |                       abstract concepts.
558 |                     </td>
559 |                     <td>
560 |                       <audio controls>
561 |                         <source src="assets/audios/0_000307.wav" />
562 |                       </audio>
563 |                       <p>PromptTTS</p>
564 |                     </td>
565 |                   </tr>
566 |                 </tbody>
567 |               </table>
568 |             </div>
569 |           </div>
570 |           <div class="container">
571 |             <div class="content">
572 |               <h2 class="title is-3">Stage 2: Model training</h2>
573 | 
574 |               <p>
575 |                 During the model training phase, we utilize a pre-trained Whisper model to extract speech features and
576 |                 generate text transcriptions from the audio input. These speech features and the text embeddings (from
577 |                 the transcription and chat template) are concatenated before being fed into Llama3. We maintain a
578 |                 consistent template structure to prevent any mismatch. Finally, the model is trained to generate speech
579 |                 descriptions, effectively learning to interpret and articulate the nuances of spoken content based on
580 |                 both acoustic features and textual information.
581 |               </p>
582 | 
583 |               <img src="assets/images/model_training.png" alt="" />
584 |             </div>
585 |           </div>
586 |         </div>
587 |       </div>
588 |     </section>
589 |   </body>
590 | </html>
591 | 


--------------------------------------------------------------------------------