└── README.md /README.md: -------------------------------------------------------------------------------- 1 |

2 |

3 |

4 | 5 | # 🚀Quick Start 6 | 7 | 1. [Introduction](#introduction) 8 | 2. [Overall](#overall) 9 | - [1. The organization of this survey](#1-the-organization-of-this-survey) 10 | - [2. General classification of spoken dialogue systems](#2-general-classification-of-spoken-dialogue-systems) 11 | - [3. Key capabilities of speech dialogue systems](#3-key-capabilities-of-speech-dialogue-systems) 12 | - [4. Publicly Available Speech Dialogue Models](#4-publicly-available-speech-dialogue-models) 13 | 3. [Representations of Spoken Dialogue Models](#representations-of-spoken-dialogue-models) 14 | 4. [Training Paradigm of Spoken Dialogue Model](#training-paradigm-of-spoken-dialogue-model) 15 | 5. [Streaming, Duplex, and Interaction](#streaming-duplex-and-interaction) 16 | 6. [Training Resources and Evaluation](#training-resources-and-evaluation) 17 | - [1. Training resources](#1-training-resources) 18 | - [2. Evaluation](#2-evaluation) 19 | 7. [Cite](#cite) 20 | 21 | # 🔥What's new 22 | 23 | - 2024.11.22: We release WavChat (A survey of spoken dialogue models about 60 pages) on arxiv! 🎉 24 | - 2024.08.31: We release [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) on arxiv. 25 | 26 | ## Introduction 27 | 28 | This repository is the official repository of the **WavChat: A Survey of Spoken Dialogue Models** [![Paper page](https://huggingface.co/datasets/huggingface/badges/raw/main/paper-page-sm-dark.svg)](https://arxiv.org/abs/2411.13577). 29 | 30 |

31 |

32 | 33 | Figure 1: The timeline of existing spoken dialogue models in recent years. 34 |

35 | 36 | > Abstract 37 | > 38 | > Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. In the broader context of multimodal models, the speech modality offers a direct interface for human-computer interaction, enabling direct communication between AI and users. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, **we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms.** We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as **speech representation, training paradigm, streaming, duplex, and interaction capabilities.** Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of **relevant datasets, evaluation metrics, and benchmarks** from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. 39 | 40 | ## Overall 41 | 42 | #### 1. The organization of this survey 43 | 44 |

45 |

46 | 47 | 48 | 49 | Figure 2: Orgnization of this survey. 50 |

51 | 52 | #### 2. General classification of spoken dialogue systems 53 | 54 |

55 |

56 | 57 | Figure 3: A general overview of current spoken dialogue systems. 58 |

59 | 60 | #### 3. Key capabilities of speech dialogue systems 61 | 62 |

63 |

64 | 65 | Figure 4: An overview of the spoken dialogue systems' nine ideal capabilities. 66 |

67 | 68 | #### 4. Publicly Available Speech Dialogue Models 69 | 70 |

71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 |

Model	URL
AudioGPT	https://github.com/AIGC-Audio/AudioGPT
SpeechGPT	https://github.com/0nutation/SpeechGPT
Freeze-Omni	https://github.com/VITA-MLLM/Freeze-Omni
Baichuan-Omni	https://github.com/westlake-baichuan-mllm/bc-omni
GLM-4-Voice	https://github.com/THUDM/GLM-4-Voice
Mini-Omni	https://github.com/gpt-omni/mini-omni
Mini-Omni2	https://github.com/gpt-omni/mini-omni2
FunAudioLLM	https://github.com/FunAudioLLM
Qwen-Audio	https://github.com/QwenLM/Qwen-Audio
Qwen2-Audio	https://github.com/QwenLM/Qwen2-Audio
LLaMA3.1	https://www.llama.com
Audio Flamingo	https://github.com/NVIDIA/audio-flamingo
Ultravox	https://github.com/fixie-ai/ultravox
Spirit LM	https://github.com/facebookresearch/spiritlm
dGSLM	https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/dgslm
Spoken-LLM	https://arxiv.org/abs/2305.11000
LLaMA-Omni	https://github.com/ictnlp/LLaMA-Omni
Moshi	https://github.com/kyutai-labs/moshi
SALMONN	https://github.com/bytedance/SALMONN
LTU-AS	https://github.com/YuanGongND/ltu
VITA	https://github.com/VITA-MLLM/VITA
SpeechGPT-Gen	https://github.com/0nutation/SpeechGPT
WavLLM	https://github.com/microsoft/SpeechT5/tree/main/WavLLM
Westlake-Omni	https://github.com/xinchen-ai/Westlake-Omni
MooER-Omni	https://github.com/MooreThreads/MooER
Hertz-dev	https://github.com/Standard-Intelligence/hertz-dev
Fish-Agent	https://github.com/fishaudio/fish-speech
SpeechGPT2	https://0nutation.github.io/SpeechGPT2.github.io/

193 | 194 | Table 1: The list of publicly available speech dialogue models and their URL 195 |

196 | 197 | ## Representations of Spoken Dialogue Models 198 | 199 | In the section Representations of Spoken Dialogue Models, we provide insights into how to represent the data in a speech dialogue model for better understanding and generation of speech. The choice of representation method directly affects the model's effectiveness in processing speech signals, system performance, and range of applications. The section covers two main types of representations: **semantic representations** and **acoustic representations**. 200 | 201 | | | Advantages of the comprehension side | Performance of unify music and audio | Compression rate of speech | Convert to historical context | Emotional and acoustic information | Pipeline for post-processing | 202 | | ------------ | ------------------------------------ | ------------------------------------ | -------------------------- | ----------------------------- | ---------------------------------- | ---------------------------- | 203 | | **Semantic** | Strong | Weak | High | Easy | Less | Cascade | 204 | | **Acoustic** | Weak | Strong | Low | Difficult | More | End-to-end | 205 | 206 |

207 | Table 2: The comparison of semantic and acoustic representations 208 |

209 | 210 |
211 | And we provide a comprehensive list of publicly available codec models and their URLs. 212 | 213 |
214 | 215 |

216 | 217 |

218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 |

Model	URL
Encodec	https://github.com/facebookresearch/encodec
SoundStream	https://github.com/wesbz/SoundStream
DAC	https://github.com/descriptinc/descript-audio-codec
WavTokenizer	https://github.com/jishengpeng/WavTokenizer
SpeechTokenizer	https://github.com/ZhangXInFD/SpeechTokenizer
SNAC	https://github.com/hubertsiuzdak/snac
SemantiCodec	https://github.com/haoheliu/SemantiCodec-inference
Mimi	https://github.com/kyutai-labs/moshi
HiFi-Codec	https://github.com/yangdongchao/AcademiCodec
FunCodec	https://github.com/modelscope/FunCodec
APCodec	https://github.com/YangAi520/APCodec/tree/main
AudioDec	https://github.com/facebookresearch/AudioDec
FACodec	https://github.com/lifeiteng/naturalspeech3_facodec
Language-Codec	https://github.com/jishengpeng/Languagecodec
XCodec	https://github.com/zhenye234/xcodec
TiCodec	https://github.com/y-ren16/TiCodec
SoCodec	https://github.com/hhguo/SoCodec
FUVC	https://github.com/z21110008/FUVC
HILCodec	https://github.com/aask1357/hilcodec
LaDiffCodec	https://github.com/haiciyang/LaDiffCodec
LLM-Codec	https://github.com/yangdongchao/LLM-Codec
SpatialCodec	https://github.com/XZWY/SpatialCodec
BigCodec	https://github.com/Aria-K-Alethia/BigCodec
SuperCodec	https://github.com/exercise-book-yq/Supercodec
RepCodec	https://github.com/mct10/RepCodec
EnCodecMAE	https://github.com/habla-liaa/encodecmae
MuCodec	https://github.com/xuyaoxun/MuCodec
SPARC	https://github.com/Berkeley-Speech-Group/Speech-Articulatory-Coding
BANC	https://github.com/anton-jeran/MULTI-AUDIODEC
SpeechRVQ	https://huggingface.co/ibm/DAC.speech.v1.0
QINCo	https://github.com/facebookresearch/Qinco
SimVQ	https://github.com/youngsheen/SimVQ

356 | 357 | 358 |

359 | 360 |

361 | Table 3: A comprehensive list of publicly available codec models and their URLs. 362 |

363 | 364 | ## **Training Paradigm of Spoken Dialogue Model** 365 | 366 | In the Training Paradigm of Spoken Dialogue Model section, we focuse on how to adapt text-based large language models (LLMs) into dialogue systems with speech processing capabilities. The **selection and design of training paradigms** have a direct impact on the **performance, real-time performance, and multimodal alignment** of the model. 367 | 368 |

369 |

370 |

371 | 372 |

373 | Figure 5: Categorization Diagram of Spoken Dialogue Model Architectural Paradigms (left) and Diagram of Multi-stage Training Steps (right) 374 |

375 | 376 |
377 | And we also comprehensively summarize an overview of the Alignment Post-training Methods. 378 | 379 |
380 | 381 |

382 | 383 |

Figure 6: Alignment Post-training Methods

384 | 385 | ## Streaming, Duplex, and Interaction 386 | 387 | The Streaming, Duplex, and Interaction section mainly discusses the implementation of **streaming processing, duplex communication, and interaction capabilities** inspeech dialogue models. These features are crucial for improving the response speed, naturalness, and interactivity of the model in real-time conversations. 388 | 389 |

390 |

391 | 392 | Figure 7: The Example Diagram of Duplex Interaction 393 |

394 | 395 | ## Training Resources and Evaluation 396 | 397 | #### 1. Training resources 398 | 399 |

400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 |

Datasets used in the various training stages
Stage	Task	Dataset	Size	URL	Modality
Modal Alignment	Multilingual TTS	Emilia	101k hrs	Link	Text, Speech
	Mandarin ASR	AISHELL-1	170 hrs	Link	Text, Speech
	Mandarin ASR	AISHELL-2	1k hrs	Link	Text, Speech
	Mandarin TTS	AISHELL-3	85 hrs, 88,035 utt., 218 spk.	Link	Text, Speech
	TTS	LibriTTS	585 hrs	Link	Text, Speech
	ASR	TED-LIUM	452 hrs	Link	Text, Speech
	ASR	VoxPopuli	1.8k hrs	Link	Text, Speech
	ASR	Librispeech	1,000 hrs	Link	Text, Speech
	ASR	MLS	44.5k hrs	Link	Text, Speech
	TTS	Wenetspeech	22.4k hrs	Link	Text, Speech
	ASR	Gigaspeech	40k hrs	Link	Text, Speech
	ASR	VCTK	300 hrs	Link	Text, Speech
	TTS	LJSpeech	24 hrs	Link	Text, Speech
	ASR	Common Voice	2,500 hrs	Link	Text, Speech
Dual-Stream Processing	Instruction	Alpaca	52,000 items	Link	Text + TTS
	Instruction	Moss	-	Link	Text + TTS
	Instruction	BelleCN	-	Link	Text + TTS
	Dialogue	UltraChat	1.5 million	Link	Text + TTS
	Instruction	Open-Orca	-	Link	Text + TTS
	Noise	DNS	2425 hrs	Link	Noise data
	Noise	MUSAN	-	Link	Noise data
Conversation Fine-Tune	Dialogue	Fisher	964 hrs	Link	Text, Speech
	Dialogue	GPT-Talker	-	Link	Text, Speech
	Instruction	INSTRUCTS2S-200K	200k items	Link	Text + TTS
	Instruction	Open Hermes	900k items	Link	Text + TTS

593 | 594 | Table 4: Datasets used in the various training stages 595 |

596 | 597 |

598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 |

Music and Non-Speech Sound Datasets
Dataset	Size	URL	Modality
ESC-50	2,000 clips (5s each)	Link	Sound
UrbanSound8K	8,732 clips (<=4s each)	Link	Sound
AudioSet	2000k+ clips (10s each)	Link	Sound
TUT Acoustic Scenes 2017	52,630 segments	Link	Sound
Warblr	10,000 clips	Link	Sound
FSD50K	51,197 clips (total 108.3 hours)	Link	Sound
DCASE Challenge	varies annually	Link	Sound
IRMAS	6,705 audio files (3s each)	Link	Music
FMA	106,574 tracks	Link	Music
NSynth	305,979 notes	Link	Music
EMOMusic	744 songs	Link	Music
MedleyDB	122 multitrack recordings	Link	Music
MagnaTagATune	25,863 clips (30s each)	Link	Music
MUSDB	150 songs	Link	Music
M4Singer	700 songs	Link	Music
Jamendo	600k songs	Link	Music

707 | 708 | Table 5: Music and Non-Speech Sound Datasets 709 |

710 | 711 | #### 2. Evaluation 712 | 713 | Evaluation is a crucial aspect of training and testing spoken dialogue models. In this section, we provide a comprehensive overview of the evaluation from **11 aspects**. The evaluation metrics are categorized into **two main types**: **Basic Evaluation**, and **Advanced Evaluation**. 714 |

715 | 716 |

717 | Table 6: This table evaluates model performance across various abilities, common tasks, representative benchmarks, and corresponding metrics. 718 |

719 | 720 | ## Cite 721 | 722 | ```bibtex 723 | @article{ji2024wavchat, 724 | title={WavChat: A Survey of Spoken Dialogue Models}, 725 | author={Ji, Shengpeng and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Lu, Jingyu and Wang, Hanting and Jiang, Ziyue and Zhou, Long and Liu, Shujie and Cheng, Xize and others}, 726 | journal={arXiv preprint arXiv:2411.13577}, 727 | year={2024} 728 | } 729 | ``` 730 | --------------------------------------------------------------------------------