├── .tags
└── README.md


/.tags:
--------------------------------------------------------------------------------
 1 | !_TAG_FILE_FORMAT	2	/extended format; --format=1 will not append ;" to lines/
 2 | !_TAG_FILE_SORTED	1	/0=unsorted, 1=sorted, 2=foldcase/
 3 | !_TAG_OUTPUT_FILESEP	slash	/slash or backslash/
 4 | !_TAG_OUTPUT_MODE	u-ctags	/u-ctags or e-ctags/
 5 | !_TAG_PATTERN_LENGTH_LIMIT	96	/0 for no limit/
 6 | !_TAG_PROGRAM_AUTHOR	Universal Ctags Team	//
 7 | !_TAG_PROGRAM_NAME	Universal Ctags	/Derived from Exuberant Ctags/
 8 | !_TAG_PROGRAM_URL	https://ctags.io/	/official site/
 9 | !_TAG_PROGRAM_VERSION	0.0.0	//
10 | Audio Language Models	README.md	/^### Audio Language Models$/;"	S	line:30	section:Large-Audio-Models""Contents
11 | Audio SSL and UL models	README.md	/^### Audio SSL and UL models$/;"	S	line:34	section:Large-Audio-Models""Contents
12 | Contents	README.md	/^## Contents$/;"	s	line:5	chapter:Large-Audio-Models
13 | Large-Audio-Models	README.md	/^# Large-Audio-Models$/;"	c	line:1
14 | Prompt-based Audio Synthesis	README.md	/^### Prompt-based Audio Synthesis$/;"	S	line:11	section:Large-Audio-Models""Contents
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Large-Audio-Models
 2 | 
 3 | We keep track of something big in the audio domain,  including speech, singing, music etc.
 4 | 
 5 | ## Contents
 6 | 
 7 | - [Spoken Language Models](#Spoken-Language-Models)
 8 | - [Prompt-based Audio Synthesis](#Prompt-based-Audio-Synthesis)
 9 | - [Audio Language Models](#Audio-Language-Models)
10 | - [Audio SSL/UL models](#Audio-SSL-and-UL-models)
11 | 
12 | ### Spoken Language Models
13 | - **Moshi: a speech-text foundation model for real-time dialogue**(2024.9) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. [[PDF]](https://kyutai.org/Moshi.pdf)[[Code]](https://github.com/kyutai-labs/moshi)
14 | 
15 | - **LLaMA-Omni: Seamless Speech Interaction with Large Language Models**(2024.9) by Qingkai Fang et al. [[PDF]](https://arxiv.org/pdf/2409.06666)[[Code]](https://github.com/ictnlp/LLaMA-Omni)
16 | 
17 | - **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming**(2024.8) by Zhifei Xie et al. [[PDF]](https://arxiv.org/pdf/2408.16725)[[Code]](https://github.com/gpt-omni/mini-omni)
18 | 
19 | - **SpeechGPT: Speech Large Language Models**(2023.5) by Dong Zhang et al. [[PDF]](https://arxiv.org/pdf/2305.11000)[[Code]](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt)
20 | 
21 | ### Prompt-based Audio Synthesis
22 | 
23 | - **M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models**(2023), Atin Sakkeer Hussain et al. [[PDF]](https://arxiv.org/pdf/2311.11255.pdf)
24 | - **SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**(2023), Xiaofei Wang et al. [[PDF]](https://arxiv.org/pdf/2308.06873.pdf)
25 | - **TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model**(2023), Deepanway Ghosal et al. [[PDF]](https://openreview.net/pdf?id=1Sn2WqLku1e)
26 | - **Diverse and Vivid Sound Generation from Text Descriptions**(2023), Guangwei Li et al. [[PDF]](https://arxiv.org/pdf/2305.01980.pdf)
27 | - **NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**(2023), Kai Shen et al. [[PDF]](https://arxiv.org/pdf/2304.09116.pdf)
28 | - **AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models**(2023), Yuancheng Wang et al. [[PDF]](https://arxiv.org/pdf/2304.00830.pdf)
29 | - **Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos**(2023), Kun Su et al. [[PDF]](https://arxiv.org/pdf/2303.16897.pdf)
30 | - **FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**(2023), Ruiqing Xue et al. [[PDF]](https://arxiv.org/pdf/2303.02939v3.pdf)
31 | - **VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling** (2023), Ziqiang Zhang et al. [[PDF]](https://arxiv.org/pdf/2303.03926.pdf)
32 | - **Simple and Controllable Music Generation**(2023), Jade Copet et al. [[PDF]](https://arxiv.org/pdf/2306.05284.pdf)
33 | - **Efficient Neural Music Generation**(2023), Max W. Y. Lam et al. [[PDF]](https://arxiv.org/pdf/2305.15719.pdf)
34 | - **ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models**(2023), Pengfei Zhu et al. [[PDF]](https://arxiv.org/pdf/2302.04456.pdf)
35 | - **Noise2Music: Text-conditioned Music Generation with Diffusion Models**(2023), Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2302.03917)
36 | - **Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**(2023), Eugene Kharitonov et al. [[PDF]](https://arxiv.org/abs/2302.03540)
37 | - **SingSong: Generating musical accompaniments from singing**(2023), Chris Donahue et al. [[PDF]](https://arxiv.org/pdf/2301.12662.pdf)
38 | - **MusicLM: Generating Music From Text**(2023), Andrea Agostinelli et al. [[PDF]](https://arxiv.org/pdf/2301.11325)
39 | - **InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt** (2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2301.13662.pdf)
40 | - **Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2305.18474.pdf)
41 | - **AudioLDM: Text-to-Audio Generation with Latent Diffusion Models**(2023), Haohe Liu et al. [[PDF]](https://arxiv.org/pdf/2301.12503)
42 | - **Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion**(2023), Flavio Schneider et al. [[PDF]](https://arxiv.org/pdf/2301.11757)
43 | - **Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models**(2023), Jiawei Huang et al. [[PDF]](https://text-to-audio.github.io/paper.pdf)
44 | - **ArchiSound: Audio Generation with Diffusion**(2023), Flavio Schneider. [[PDF]](https://arxiv.org/ftp/arxiv/papers/2301/2301.13267.pdf)
45 | - **VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers** (2023), Chengyi Wang et al. [[PDF]](https://arxiv.org/pdf/2301.02111.pdf)
46 | - **PromptTTS: Controllable Text-to-Speech with Text Descriptions**(2022), Zhifang Guo et al. [[PDF]](https://arxiv.org/pdf/2211.12171.pdf)
47 | - **Diffsound: Discrete Diffusion Model for Text-to-sound Generation**(2022), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2207.09983v1.pdf)
48 | 
49 | ### Audio Language Models
50 | 
51 | - **Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models**(2023), Yunfei Chu et al. [[PDF]](https://arxiv.org/pdf/2311.07919v1.pdf)
52 | - **UniAudio: An Audio Foundation Model Toward Universal Audio Generation**(2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2310.00704.pdf)
53 | - **SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**(2023), Xin Zhang et al. [[PDF]](https://arxiv.org/pdf/2308.16692.pdf)
54 | - **SoundStorm: Efficient Parallel Audio Generation**(2023), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2305.09636.pdf)
55 | - **AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2304.12995.pdf)
56 | - **AudioPaLM: A Large Language Model That Can Speak and Listen**(2023), Paul K. Rubenstein et al. [[PDF]](https://arxiv.org/pdf/2306.12925.pdf)
57 | - **Pengi: An Audio Language Model for Audio Tasks**(2023), Soham Deshmukh et al. [[PDF]](https://arxiv.org/pdf/2305.11834)
58 | - **AudioLM: a Language Modeling Approach to Audio Generation**(2022), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2209.03143)
59 | 
60 | ### Audio SSL and UL models
61 | 
62 | - **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations**(2019), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/1910.05453.pdf)
63 | - **wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations** (2020), Alexei Baevski et al. [[PDF]](https://arxiv.org/pdf/2006.11477.pdf)
64 | - **W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training** (2021) [[PDF]](https://arxiv.org/pdf/2108.06209.pdf)
65 | - **HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units** (2021) Wei-Ning Hsu et al. [[PDF]](https://arxiv.org/pdf/2106.07447.pdf)
66 | - **Data2vec: A general framework for self-supervised learning in speech, vision and language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2202.03555.pdf)
67 | - **MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets** (2022), Ziyang Ma et al. [[PDF]](https://arxiv.org/abs/2211.07321.pdf)
68 | - **ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers** (2022), Kaizhi Qian et al. [[PDF]](https://arxiv.org/pdf/2204.09224.pdf)
69 | - **Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2212.07525.pdf)
70 | - **MuLan: A Joint Embedding of Music Audio and Natural Language** (2022) Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2208.12415.pdf)
71 |   
72 | 
73 | 


--------------------------------------------------------------------------------