├── .gitignore
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | plantilla.txt
2 | diffusion-audio.code-workspace
3 | enlaces.txt
4 | temp.txt
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Modelos de audio, aplicaciones y utilidades basadas en diffusion
  2 | ### (y algún otra que me resulte interesante)
  3 | 
  4 | <br>
  5 | 
  6 | * [ArchiSound](#ArchiSound)
  7 | * [AudioLDM](#AudioLDM)
  8 | * [AudioLM](#AudioLM)
  9 | * [Bark](#Bark)
 10 | * [BigVGAN](#BigVGAN)
 11 | * [Dance Diffusion](#Dance-Diffusion)
 12 | * [DiffWave](#DiffWave)
 13 | * [Make-An-Audio](#Make-An-Audio)
 14 | * [Moûsai](#Moûsai)
 15 | * [Msanii](#Msanii)
 16 | * [MusicLM](#MusicLM)
 17 | * [Noise2Music](#Noise2Music)
 18 | * [RAVE 2](#RAVE-2)
 19 | * [Riffusion](#Riffusion)
 20 | * [SingSong](#SingSong)
 21 | * [Tango](#Tango)
 22 | 
 23 | <br><br>
 24 | 
 25 | ##### 2023-04-29
 26 | ## Tango
 27 | #### 24kHz
 28 | 
 29 | Inspired by LLM models, we adopt such an instruction-tuned LLM FLAN-T5 as the text encoder for text-to-audio (TTA) generation—a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach (TANGO) outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen.
 30 | 
 31 | [Paper](https://arxiv.org/abs/2304.13731)
 32 | –
 33 | [Code](https://github.com/declare-lab/tango) 
 34 | –
 35 | [Demo](https://huggingface.co/spaces/declare-lab/tango-text-to-audio-generation) 
 36 | –
 37 | [Examples](https://tango-web.github.io/)
 38 | 
 39 | <br><br>
 40 | 
 41 | ##### 2023-04-29
 42 | ## Bark
 43 | #### 24kHz
 44 | 
 45 | Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. 
 46 | 
 47 | [Code](https://github.com/suno-ai/bark) 
 48 | –
 49 | [Demo](https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing) 
 50 | –
 51 | [Examples](https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a42244ba45ebc2e2)
 52 | 
 53 | <br><br>
 54 | 
 55 | ##### 2023-03-04
 56 | ## BigVGAN
 57 | #### 24kHz
 58 | 
 59 | This abstract discusses a new universal vocoder called BigVGAN, which is able to generate high-quality audio for many different speakers and environments without needing to be fine-tuned. The model uses a periodic activation function and anti-aliasing to improve audio quality and has been trained on a very large dataset of speech. BigVGAN achieves state-of-the-art performance for various zero-shot (out-of-distribution) conditions such as unseen speakers, languages, recording environments, singing voices, music, and instrumental audio.
 60 | 
 61 | [Paper](https://arxiv.org/abs/2206.04658)
 62 | –
 63 | [Code](https://github.com/NVIDIA/BigVGAN) 
 64 | –
 65 | [Demo](https://huggingface.co/spaces/L0SG/BigVGAN) 
 66 | –
 67 | [Examples](https://bigvgan-demo.github.io/)
 68 | 
 69 | <br><br>
 70 | 
 71 | ##### 2023-01-31
 72 | ## AudioLDM
 73 | #### 16kHz
 74 | 
 75 | This study proposes AudioLDM, a text-to-audio system that is built on a latent space to learn continuous audio representations from language-audio pretraining latents and enables various text-guided audio manipulations. AudioLDM is advantageous in both generation quality and computational efficiency, and achieves state-of-the-art performance when trained on AudioCaps with a single GPU.
 76 | 
 77 | [Paper](https://arxiv.org/abs/2301.12503)
 78 | –
 79 | [Code 1](https://huggingface.co/docs/diffusers/api/pipelines/audioldm), 
 80 | [Code 2](https://github.com/haoheliu/AudioLDM)
 81 | –
 82 | [Demo](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
 83 | –
 84 | [Examples](https://audioldm.github.io/)
 85 | 
 86 | <br><br>
 87 | 
 88 | ##### 2023-01-30
 89 | ## SingSong
 90 | #### 44kHz
 91 | 
 92 | SingSong is a system which generates instrumental music to accompany input vocals, and uses recent developments in musical source separation and audio generation. Through source separation, aligned pairs of vocals and instrumentals are produced, and AudioLM is adapted for conditional audio-to-audio generation tasks. Tests show improved performance on isolated vocals by 53%, and listeners expressed preference for instrumentals generated by SingSong compared to those from a retrieval baseline.
 93 | 
 94 | [Paper](https://arxiv.org/abs/2301.11757)
 95 | –
 96 | [Examples](https://storage.googleapis.com/sing-song/index.html?s=35c)
 97 | 
 98 | <br><br>
 99 | 
100 | ##### 2023-01-30
101 | ## Moûsai
102 | #### 48kHz
103 | 
104 | This work investigates the potential of text-conditional music generation using a cascading latent diffusion approach which can generate high-quality music at 48kHz from textual descriptions, while maintaining reasonable inference speed on a single consumer GPU.
105 | 
106 | [Paper](https://arxiv.org/abs/2301.11757)
107 | –
108 | [Code](https://github.com/archinetai/audio-diffusion-pytorch)
109 | –
110 | [Examples](https://anonymous0.notion.site/anonymous0/Mo-sai-Text-to-Audio-with-Long-Context-Latent-Diffusion-b43dbc71caf94b5898f9e8de714ab5dc)
111 | 
112 | <br><br>
113 | 
114 | ##### 2023-01-30
115 | ## Archisound
116 | #### 48kHz
117 | 
118 | Diffusion models have become increasingly popular for image generation, and this has sparked interest in their potential applications to audio generation. This work explores the potential of diffusion models for audio generation by proposing a set of models that address multiple aspects such as temporal dimension, long term structure, and overlapping sounds. In order to maintain reasonable inference speed, these models are designed to achieve real-time results on a single consumer GPU. Open source libraries are also provided to facilitate further research in the field.
119 | 
120 | [Paper](https://arxiv.org/abs/2301.13267)
121 | –
122 | [Code](https://github.com/archinetai/audio-diffusion-pytorch)
123 | –
124 | [Examples](https://flavioschneider.notion.site/flavioschneider/Audio-Generation-with-Diffusion-c4f29f39048d4f03a23da13078a44cdb)
125 | 
126 | <br><br>
127 | 
128 | 
129 | ##### 2023-01-29
130 | ## Make-An-Audio
131 | #### 16kHz
132 | 
133 | Make-An-Audio is a prompt-enhanced diffusion model for large-scale text-to-audio generation. It alleviates data scarcity with pseudo prompt enhancement and leverages spectrogram autoencoder to predict the self-supervised audio representation, achieving state-of-the-art results in objective and subjective evaluations. We also present its controllability with classifier-free guidance and generalization for X-to-Audio with "No Modality Left Behind".
134 | 
135 | [Paper](https://text-to-audio.github.io/paper.pdf)
136 | –
137 | [Examples](https://text-to-audio.github.io/)
138 | 
139 | <br><br>
140 | 
141 | ##### 2023-01-28
142 | ## Msanii
143 | #### 44kHz
144 | 
145 | Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently. This is the first work to successfully employ diffusion models for synthesizing such long music samples at high sample rates.
146 | 
147 | [Paper](https://arxiv.org/abs/2301.06468)
148 | –
149 | [Code](https://github.com/Kinyugo/msanii)
150 | –
151 | [Demo](https://huggingface.co/spaces/kinyugo/msanii)
152 | –
153 | [Examples](https://kinyugo.github.io/msanii-demo/)
154 | 
155 | <br><br>
156 | 
157 | ##### 2023-01-28
158 | ## Noise2Music
159 | 
160 | We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models are trained and utilized in succession to generate high-fidelity music.
161 | 
162 | We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. The generated audio faithfully reflects key elements of the text prompt such as genre, tempo, instruments, mood and era, but goes beyond to ground fine-grained semantics of the prompt. 
163 | 
164 | [Examples](https://noise2music.github.io/)
165 | 
166 | <br><br>
167 | 
168 | ##### 2023-01-27
169 | ## RAVE 2
170 | #### 48kHz
171 | 
172 | This paper introduces a real-time Audio Variational autoEncoder (RAVE) for fast and high-quality audio waveform synthesis. We show that it is the first model able to generate 48kHz audio signals while running 20 times faster than real-time on a standard laptop CPU. Our novel two-stage training procedure and post-training analysis of the latent space allows direct control over reconstruction fidelity and representation compactness. We evaluate the quality of the synthesized audio using quantitative and qualitative experiments, showing its superiority over existing models. Finally, we demonstrate applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available. 
173 | 
174 | [Paper](https://arxiv.org/abs/2111.05011)
175 | –
176 | [Code](https://github.com/acids-ircam/RAVE)
177 | –
178 | [Demo](https://www.youtube.com/watch?v=jAIRf4nGgYI)
179 | –
180 | [Examples](https://rmfrancis.bandcamp.com/album/pedimos-un-mensaje)
181 | 
182 | <br><br>
183 | 
184 | ### 2023-01-27
185 | ## MusicLM
186 | ##### 24kHz
187 | 
188 | We introduce MusicLM, a model that generates high-fidelity music from text descriptions. We also release MusicCaps, a dataset of 5.5k music-text pairs with rich text descriptions provided by human experts. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description, and it can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. 
189 | 
190 | [Paper](https://arxiv.org/abs/2301.11325)
191 | –
192 | [3rd. party code](https://github.com/lucidrains/musiclm-pytorch)
193 | –
194 | [Examples](https://google-research.github.io/seanet/musiclm/examples/)
195 | 
196 | <br><br>
197 | 
198 | ##### 2022-11-25
199 | ## Riffusion
200 | 
201 |  Riffusion is a library for real-time music and audio generation with stable diffusion.
202 | 
203 | [Paper](https://www.riffusion.com/about)
204 | –
205 | [Code](https://github.com/riffusion/riffusion)
206 | –
207 | [Demo](https://www.riffusion.com/)
208 | 
209 | <br><br>
210 | 
211 | ##### 2022-09-07
212 | ## AudioLM
213 | 
214 | AudioLM is a framework for high-quality audio generation with long-term consistency. It maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task. A hybrid tokenization scheme is proposed to achieve both reconstruction quality and long-term structure. AudioLM has been used to generate speech continuations while maintaining speaker identity and prosody, as well as piano music continuations without any symbolic representation of music.
215 | 
216 | [Paper](https://arxiv.org/abs/2209.03143)
217 | –
218 | [3rd. party code](https://github.com/lucidrains/audiolm-pytorch)
219 | –
220 | [Examples](https://google-research.github.io/seanet/audiolm/examples/)
221 | –
222 | [Extra info](https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html)
223 | 
224 | <br><br>
225 | 
226 | ##### 2022-07-22
227 | ## Dance Diffusion
228 | #### 16-44kHz
229 | 
230 | [Code](https://github.com/harmonai-org/sample-generator)
231 | –
232 | [Demo](https://colab.research.google.com/github/harmonai-org/sample-generator/blob/main/Dance_Diffusion.ipynb)
233 | 
234 | <br><br>
235 | 
236 | ##### 2020-09-21
237 | ## DiffWave
238 | ##### 22kHz
239 | 
240 | This paper introduces DiffWave, a non-autoregressive, diffusion probabilistic model for waveform generation. It is more efficient than WaveNet vocoders and outperforms autoregressive and GAN-based models in terms of audio quality and sample diversity.
241 | 
242 | [Paper](https://arxiv.org/abs/2009.09761)
243 | –
244 | [Code](https://github.com/lmnt-com/diffwave)
245 | –
246 | [3rd. party code](https://github.com/philsyn/DiffWave-Vocoder)
247 | –
248 | [Examples](https://diffwave-demo.github.io/)
249 | 


--------------------------------------------------------------------------------