├── .tags └── README.md /.tags: -------------------------------------------------------------------------------- 1 | !_TAG_FILE_FORMAT 2 /extended format; --format=1 will not append ;" to lines/ 2 | !_TAG_FILE_SORTED 1 /0=unsorted, 1=sorted, 2=foldcase/ 3 | !_TAG_OUTPUT_FILESEP slash /slash or backslash/ 4 | !_TAG_OUTPUT_MODE u-ctags /u-ctags or e-ctags/ 5 | !_TAG_PATTERN_LENGTH_LIMIT 96 /0 for no limit/ 6 | !_TAG_PROGRAM_AUTHOR Universal Ctags Team // 7 | !_TAG_PROGRAM_NAME Universal Ctags /Derived from Exuberant Ctags/ 8 | !_TAG_PROGRAM_URL https://ctags.io/ /official site/ 9 | !_TAG_PROGRAM_VERSION 0.0.0 // 10 | Audio Language Models README.md /^### Audio Language Models$/;" S line:30 section:Large-Audio-Models""Contents 11 | Audio SSL and UL models README.md /^### Audio SSL and UL models$/;" S line:34 section:Large-Audio-Models""Contents 12 | Contents README.md /^## Contents$/;" s line:5 chapter:Large-Audio-Models 13 | Large-Audio-Models README.md /^# Large-Audio-Models$/;" c line:1 14 | Prompt-based Audio Synthesis README.md /^### Prompt-based Audio Synthesis$/;" S line:11 section:Large-Audio-Models""Contents 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Large-Audio-Models 2 | 3 | We keep track of something big in the audio domain, including speech, singing, music etc. 4 | 5 | ## Contents 6 | 7 | - [Spoken Language Models](#Spoken-Language-Models) 8 | - [Prompt-based Audio Synthesis](#Prompt-based-Audio-Synthesis) 9 | - [Audio Language Models](#Audio-Language-Models) 10 | - [Audio SSL/UL models](#Audio-SSL-and-UL-models) 11 | 12 | ### Spoken Language Models 13 | - **Moshi: a speech-text foundation model for real-time dialogue**(2024.9) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. [[PDF]](https://kyutai.org/Moshi.pdf)[[Code]](https://github.com/kyutai-labs/moshi) 14 | 15 | - **LLaMA-Omni: Seamless Speech Interaction with Large Language Models**(2024.9) by Qingkai Fang et al. [[PDF]](https://arxiv.org/pdf/2409.06666)[[Code]](https://github.com/ictnlp/LLaMA-Omni) 16 | 17 | - **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming**(2024.8) by Zhifei Xie et al. [[PDF]](https://arxiv.org/pdf/2408.16725)[[Code]](https://github.com/gpt-omni/mini-omni) 18 | 19 | - **SpeechGPT: Speech Large Language Models**(2023.5) by Dong Zhang et al. [[PDF]](https://arxiv.org/pdf/2305.11000)[[Code]](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt) 20 | 21 | ### Prompt-based Audio Synthesis 22 | 23 | - **M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models**(2023), Atin Sakkeer Hussain et al. [[PDF]](https://arxiv.org/pdf/2311.11255.pdf) 24 | - **SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**(2023), Xiaofei Wang et al. [[PDF]](https://arxiv.org/pdf/2308.06873.pdf) 25 | - **TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model**(2023), Deepanway Ghosal et al. [[PDF]](https://openreview.net/pdf?id=1Sn2WqLku1e) 26 | - **Diverse and Vivid Sound Generation from Text Descriptions**(2023), Guangwei Li et al. [[PDF]](https://arxiv.org/pdf/2305.01980.pdf) 27 | - **NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**(2023), Kai Shen et al. [[PDF]](https://arxiv.org/pdf/2304.09116.pdf) 28 | - **AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models**(2023), Yuancheng Wang et al. [[PDF]](https://arxiv.org/pdf/2304.00830.pdf) 29 | - **Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos**(2023), Kun Su et al. [[PDF]](https://arxiv.org/pdf/2303.16897.pdf) 30 | - **FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**(2023), Ruiqing Xue et al. [[PDF]](https://arxiv.org/pdf/2303.02939v3.pdf) 31 | - **VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling** (2023), Ziqiang Zhang et al. [[PDF]](https://arxiv.org/pdf/2303.03926.pdf) 32 | - **Simple and Controllable Music Generation**(2023), Jade Copet et al. [[PDF]](https://arxiv.org/pdf/2306.05284.pdf) 33 | - **Efficient Neural Music Generation**(2023), Max W. Y. Lam et al. [[PDF]](https://arxiv.org/pdf/2305.15719.pdf) 34 | - **ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models**(2023), Pengfei Zhu et al. [[PDF]](https://arxiv.org/pdf/2302.04456.pdf) 35 | - **Noise2Music: Text-conditioned Music Generation with Diffusion Models**(2023), Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2302.03917) 36 | - **Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**(2023), Eugene Kharitonov et al. [[PDF]](https://arxiv.org/abs/2302.03540) 37 | - **SingSong: Generating musical accompaniments from singing**(2023), Chris Donahue et al. [[PDF]](https://arxiv.org/pdf/2301.12662.pdf) 38 | - **MusicLM: Generating Music From Text**(2023), Andrea Agostinelli et al. [[PDF]](https://arxiv.org/pdf/2301.11325) 39 | - **InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt** (2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2301.13662.pdf) 40 | - **Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2305.18474.pdf) 41 | - **AudioLDM: Text-to-Audio Generation with Latent Diffusion Models**(2023), Haohe Liu et al. [[PDF]](https://arxiv.org/pdf/2301.12503) 42 | - **Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion**(2023), Flavio Schneider et al. [[PDF]](https://arxiv.org/pdf/2301.11757) 43 | - **Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models**(2023), Jiawei Huang et al. [[PDF]](https://text-to-audio.github.io/paper.pdf) 44 | - **ArchiSound: Audio Generation with Diffusion**(2023), Flavio Schneider. [[PDF]](https://arxiv.org/ftp/arxiv/papers/2301/2301.13267.pdf) 45 | - **VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers** (2023), Chengyi Wang et al. [[PDF]](https://arxiv.org/pdf/2301.02111.pdf) 46 | - **PromptTTS: Controllable Text-to-Speech with Text Descriptions**(2022), Zhifang Guo et al. [[PDF]](https://arxiv.org/pdf/2211.12171.pdf) 47 | - **Diffsound: Discrete Diffusion Model for Text-to-sound Generation**(2022), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2207.09983v1.pdf) 48 | 49 | ### Audio Language Models 50 | 51 | - **Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models**(2023), Yunfei Chu et al. [[PDF]](https://arxiv.org/pdf/2311.07919v1.pdf) 52 | - **UniAudio: An Audio Foundation Model Toward Universal Audio Generation**(2023), Dongchao Yang et al. [[PDF]](https://arxiv.org/pdf/2310.00704.pdf) 53 | - **SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**(2023), Xin Zhang et al. [[PDF]](https://arxiv.org/pdf/2308.16692.pdf) 54 | - **SoundStorm: Efficient Parallel Audio Generation**(2023), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2305.09636.pdf) 55 | - **AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head**(2023), Rongjie Huang et al. [[PDF]](https://arxiv.org/pdf/2304.12995.pdf) 56 | - **AudioPaLM: A Large Language Model That Can Speak and Listen**(2023), Paul K. Rubenstein et al. [[PDF]](https://arxiv.org/pdf/2306.12925.pdf) 57 | - **Pengi: An Audio Language Model for Audio Tasks**(2023), Soham Deshmukh et al. [[PDF]](https://arxiv.org/pdf/2305.11834) 58 | - **AudioLM: a Language Modeling Approach to Audio Generation**(2022), Zalán Borsos et al. [[PDF]](https://arxiv.org/pdf/2209.03143) 59 | 60 | ### Audio SSL and UL models 61 | 62 | - **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations**(2019), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/1910.05453.pdf) 63 | - **wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations** (2020), Alexei Baevski et al. [[PDF]](https://arxiv.org/pdf/2006.11477.pdf) 64 | - **W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training** (2021) [[PDF]](https://arxiv.org/pdf/2108.06209.pdf) 65 | - **HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units** (2021) Wei-Ning Hsu et al. [[PDF]](https://arxiv.org/pdf/2106.07447.pdf) 66 | - **Data2vec: A general framework for self-supervised learning in speech, vision and language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2202.03555.pdf) 67 | - **MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets** (2022), Ziyang Ma et al. [[PDF]](https://arxiv.org/abs/2211.07321.pdf) 68 | - **ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers** (2022), Kaizhi Qian et al. [[PDF]](https://arxiv.org/pdf/2204.09224.pdf) 69 | - **Data2vec 2.0: Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language** (2022), Alexei Baevski et al. [[PDF]](https://arxiv.org/abs/2212.07525.pdf) 70 | - **MuLan: A Joint Embedding of Music Audio and Natural Language** (2022) Qingqing Huang et al. [[PDF]](https://arxiv.org/pdf/2208.12415.pdf) 71 | 72 | 73 | --------------------------------------------------------------------------------