└── README.md /README.md: -------------------------------------------------------------------------------- 1 |
2 | image-20241111160012489 3 |
4 | 5 | # 🚀Quick Start 6 | 7 | 1. [Introduction](#introduction) 8 | 2. [Overall](#overall) 9 | - [1. The organization of this survey](#1-the-organization-of-this-survey) 10 | - [2. General classification of spoken dialogue systems](#2-general-classification-of-spoken-dialogue-systems) 11 | - [3. Key capabilities of speech dialogue systems](#3-key-capabilities-of-speech-dialogue-systems) 12 | - [4. Publicly Available Speech Dialogue Models](#4-publicly-available-speech-dialogue-models) 13 | 3. [Representations of Spoken Dialogue Models](#representations-of-spoken-dialogue-models) 14 | 4. [Training Paradigm of Spoken Dialogue Model](#training-paradigm-of-spoken-dialogue-model) 15 | 5. [Streaming, Duplex, and Interaction](#streaming-duplex-and-interaction) 16 | 6. [Training Resources and Evaluation](#training-resources-and-evaluation) 17 | - [1. Training resources](#1-training-resources) 18 | - [2. Evaluation](#2-evaluation) 19 | 7. [Cite](#cite) 20 | 21 | # 🔥What's new 22 | 23 | - 2024.11.22: We release WavChat (A survey of spoken dialogue models about 60 pages) on arxiv! 🎉 24 | - 2024.08.31: We release [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) on arxiv. 25 | 26 | ## Introduction 27 | 28 | This repository is the official repository of the **WavChat: A Survey of Spoken Dialogue Models** [![Paper page](https://huggingface.co/datasets/huggingface/badges/raw/main/paper-page-sm-dark.svg)](https://arxiv.org/abs/2411.13577). 29 | 30 |
31 | img1-paper-list 32 | 33 | Figure 1: The timeline of existing spoken dialogue models in recent years. 34 |
35 | 36 | > Abstract 37 | > 38 | > Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. In the broader context of multimodal models, the speech modality offers a direct interface for human-computer interaction, enabling direct communication between AI and users. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, **we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms.** We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as **speech representation, training paradigm, streaming, duplex, and interaction capabilities.** Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of **relevant datasets, evaluation metrics, and benchmarks** from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. 39 | 40 | ## Overall 41 | 42 | #### 1. The organization of this survey 43 | 44 |
45 | WavChat - 副本 46 | 47 | 48 | 49 | Figure 2: Orgnization of this survey. 50 |
51 | 52 | #### 2. General classification of spoken dialogue systems 53 | 54 |
55 | img2-method 56 | 57 | Figure 3: A general overview of current spoken dialogue systems. 58 |
59 | 60 | #### 3. Key capabilities of speech dialogue systems 61 | 62 |
63 | image-20241111165006367 64 | 65 | Figure 4: An overview of the spoken dialogue systems' nine ideal capabilities. 66 |
67 | 68 | #### 4. Publicly Available Speech Dialogue Models 69 | 70 |
71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 |
ModelURL
AudioGPThttps://github.com/AIGC-Audio/AudioGPT
SpeechGPThttps://github.com/0nutation/SpeechGPT
Freeze-Omnihttps://github.com/VITA-MLLM/Freeze-Omni
Baichuan-Omnihttps://github.com/westlake-baichuan-mllm/bc-omni
GLM-4-Voicehttps://github.com/THUDM/GLM-4-Voice
Mini-Omnihttps://github.com/gpt-omni/mini-omni
Mini-Omni2https://github.com/gpt-omni/mini-omni2
FunAudioLLMhttps://github.com/FunAudioLLM
Qwen-Audiohttps://github.com/QwenLM/Qwen-Audio
Qwen2-Audiohttps://github.com/QwenLM/Qwen2-Audio
LLaMA3.1https://www.llama.com
Audio Flamingohttps://github.com/NVIDIA/audio-flamingo
Ultravoxhttps://github.com/fixie-ai/ultravox
Spirit LMhttps://github.com/facebookresearch/spiritlm
dGSLMhttps://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/dgslm
Spoken-LLMhttps://arxiv.org/abs/2305.11000
LLaMA-Omnihttps://github.com/ictnlp/LLaMA-Omni
Moshihttps://github.com/kyutai-labs/moshi
SALMONNhttps://github.com/bytedance/SALMONN
LTU-AShttps://github.com/YuanGongND/ltu
VITAhttps://github.com/VITA-MLLM/VITA
SpeechGPT-Genhttps://github.com/0nutation/SpeechGPT
WavLLMhttps://github.com/microsoft/SpeechT5/tree/main/WavLLM
Westlake-Omnihttps://github.com/xinchen-ai/Westlake-Omni
MooER-Omnihttps://github.com/MooreThreads/MooER
Hertz-devhttps://github.com/Standard-Intelligence/hertz-dev
Fish-Agenthttps://github.com/fishaudio/fish-speech
SpeechGPT2https://0nutation.github.io/SpeechGPT2.github.io/
193 | 194 | Table 1: The list of publicly available speech dialogue models and their URL 195 |
196 | 197 | ## Representations of Spoken Dialogue Models 198 | 199 | In the section Representations of Spoken Dialogue Models, we provide insights into how to represent the data in a speech dialogue model for better understanding and generation of speech. The choice of representation method directly affects the model's effectiveness in processing speech signals, system performance, and range of applications. The section covers two main types of representations: **semantic representations** and **acoustic representations**. 200 | 201 | | | Advantages of the comprehension side | Performance of unify music and audio | Compression rate of speech | Convert to historical context | Emotional and acoustic information | Pipeline for post-processing | 202 | | ------------ | ------------------------------------ | ------------------------------------ | -------------------------- | ----------------------------- | ---------------------------------- | ---------------------------- | 203 | | **Semantic** | Strong | Weak | High | Easy | Less | Cascade | 204 | | **Acoustic** | Weak | Strong | Low | Difficult | More | End-to-end | 205 | 206 |
207 | Table 2: The comparison of semantic and acoustic representations 208 |
209 | 210 |
211 | And we provide a comprehensive list of publicly available codec models and their URLs. 212 | 213 |
214 | 215 |
216 | 217 |
218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 |
ModelURL
Encodechttps://github.com/facebookresearch/encodec
SoundStreamhttps://github.com/wesbz/SoundStream
DAChttps://github.com/descriptinc/descript-audio-codec
WavTokenizerhttps://github.com/jishengpeng/WavTokenizer
SpeechTokenizerhttps://github.com/ZhangXInFD/SpeechTokenizer
SNAChttps://github.com/hubertsiuzdak/snac
SemantiCodechttps://github.com/haoheliu/SemantiCodec-inference
Mimihttps://github.com/kyutai-labs/moshi
HiFi-Codechttps://github.com/yangdongchao/AcademiCodec
FunCodechttps://github.com/modelscope/FunCodec
APCodechttps://github.com/YangAi520/APCodec/tree/main
AudioDechttps://github.com/facebookresearch/AudioDec
FACodechttps://github.com/lifeiteng/naturalspeech3_facodec
Language-Codechttps://github.com/jishengpeng/Languagecodec
XCodechttps://github.com/zhenye234/xcodec
TiCodechttps://github.com/y-ren16/TiCodec
SoCodechttps://github.com/hhguo/SoCodec
FUVChttps://github.com/z21110008/FUVC
HILCodechttps://github.com/aask1357/hilcodec
LaDiffCodechttps://github.com/haiciyang/LaDiffCodec
LLM-Codechttps://github.com/yangdongchao/LLM-Codec
SpatialCodechttps://github.com/XZWY/SpatialCodec
BigCodechttps://github.com/Aria-K-Alethia/BigCodec
SuperCodechttps://github.com/exercise-book-yq/Supercodec
RepCodechttps://github.com/mct10/RepCodec
EnCodecMAEhttps://github.com/habla-liaa/encodecmae
MuCodechttps://github.com/xuyaoxun/MuCodec
SPARChttps://github.com/Berkeley-Speech-Group/Speech-Articulatory-Coding
BANChttps://github.com/anton-jeran/MULTI-AUDIODEC
SpeechRVQhttps://huggingface.co/ibm/DAC.speech.v1.0
QINCohttps://github.com/facebookresearch/Qinco
SimVQhttps://github.com/youngsheen/SimVQ
356 | 357 | 358 |
359 | 360 |
361 | Table 3: A comprehensive list of publicly available codec models and their URLs. 362 |
363 | 364 | ## **Training Paradigm of Spoken Dialogue Model** 365 | 366 | In the Training Paradigm of Spoken Dialogue Model section, we focuse on how to adapt text-based large language models (LLMs) into dialogue systems with speech processing capabilities. The **selection and design of training paradigms** have a direct impact on the **performance, real-time performance, and multimodal alignment** of the model. 367 | 368 |
369 | 370 |
371 | 372 |
373 | Figure 5: Categorization Diagram of Spoken Dialogue Model Architectural Paradigms (left) and Diagram of Multi-stage Training Steps (right) 374 |
375 | 376 |
377 | And we also comprehensively summarize an overview of the Alignment Post-training Methods. 378 | 379 |
380 | 381 |
image-20241114142618698
382 | 383 |
Figure 6: Alignment Post-training Methods
384 | 385 | ## Streaming, Duplex, and Interaction 386 | 387 | The Streaming, Duplex, and Interaction section mainly discusses the implementation of **streaming processing, duplex communication, and interaction capabilities** inspeech dialogue models. These features are crucial for improving the response speed, naturalness, and interactivity of the model in real-time conversations. 388 | 389 |
390 | tu_03 391 | 392 | Figure 7: The Example Diagram of Duplex Interaction 393 |
394 | 395 | ## Training Resources and Evaluation 396 | 397 | #### 1. Training resources 398 | 399 |
400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 |
Datasets used in the various training stages
StageTaskDatasetSizeURLModality
Modal AlignmentMultilingual TTSEmilia101k hrsLinkText, Speech
Mandarin ASRAISHELL-1170 hrsLinkText, Speech
Mandarin ASRAISHELL-21k hrsLinkText, Speech
Mandarin TTSAISHELL-385 hrs, 88,035 utt., 218 spk.LinkText, Speech
TTSLibriTTS585 hrsLinkText, Speech
ASRTED-LIUM452 hrsLinkText, Speech
ASRVoxPopuli1.8k hrsLinkText, Speech
ASRLibrispeech1,000 hrsLinkText, Speech
ASRMLS44.5k hrsLinkText, Speech
TTSWenetspeech22.4k hrsLinkText, Speech
ASRGigaspeech40k hrsLinkText, Speech
ASRVCTK300 hrsLinkText, Speech
TTSLJSpeech24 hrsLinkText, Speech
ASRCommon Voice2,500 hrsLinkText, Speech
Dual-Stream ProcessingInstructionAlpaca52,000 itemsLinkText + TTS
InstructionMoss-LinkText + TTS
InstructionBelleCN-LinkText + TTS
DialogueUltraChat1.5 millionLinkText + TTS
InstructionOpen-Orca-LinkText + TTS
NoiseDNS2425 hrsLinkNoise data
NoiseMUSAN-LinkNoise data
Conversation Fine-TuneDialogueFisher964 hrsLinkText, Speech
DialogueGPT-Talker-LinkText, Speech
InstructionINSTRUCTS2S-200K200k itemsLinkText + TTS
InstructionOpen Hermes900k itemsLinkText + TTS
593 | 594 | Table 4: Datasets used in the various training stages 595 |
596 | 597 |
598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 |
Music and Non-Speech Sound Datasets
DatasetSizeURLModality
ESC-502,000 clips (5s each)LinkSound
UrbanSound8K8,732 clips (<=4s each)LinkSound
AudioSet2000k+ clips (10s each)LinkSound
TUT Acoustic Scenes 201752,630 segmentsLinkSound
Warblr10,000 clipsLinkSound
FSD50K51,197 clips (total 108.3 hours)LinkSound
DCASE Challengevaries annuallyLinkSound
IRMAS6,705 audio files (3s each)LinkMusic
FMA106,574 tracksLinkMusic
NSynth305,979 notesLinkMusic
EMOMusic744 songsLinkMusic
MedleyDB122 multitrack recordingsLinkMusic
MagnaTagATune25,863 clips (30s each)LinkMusic
MUSDB150 songsLinkMusic
M4Singer700 songsLinkMusic
Jamendo600k songsLinkMusic
707 | 708 | Table 5: Music and Non-Speech Sound Datasets 709 |
710 | 711 | #### 2. Evaluation 712 | 713 | Evaluation is a crucial aspect of training and testing spoken dialogue models. In this section, we provide a comprehensive overview of the evaluation from **11 aspects**. The evaluation metrics are categorized into **two main types**: **Basic Evaluation**, and **Advanced Evaluation**. 714 |
image-20241114143903640
715 | 716 |
717 | Table 6: This table evaluates model performance across various abilities, common tasks, representative benchmarks, and corresponding metrics. 718 |
719 | 720 | ## Cite 721 | 722 | ```bibtex 723 | @article{ji2024wavchat, 724 | title={WavChat: A Survey of Spoken Dialogue Models}, 725 | author={Ji, Shengpeng and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Lu, Jingyu and Wang, Hanting and Jiang, Ziyue and Zhou, Long and Liu, Shujie and Cheng, Xize and others}, 726 | journal={arXiv preprint arXiv:2411.13577}, 727 | year={2024} 728 | } 729 | ``` 730 | --------------------------------------------------------------------------------