└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # All About Speech 2 | This repository organizes papers, learning materials, codes for the purpose of understanding speech. There is another repository for machine/deep learning [here](https://github.com/jinny1208/All-Resources-Related-to-ML-DL). 3 | 4 | ## To Dos: 5 | - organize stars 6 | - add more papers 7 | - papers to read: 8 | 1. Speech=T:Transducer for TTS and Beyond 9 | 10 | ## TTS 11 | * TTS 12 | - DC-TTS [[paper]] [[pytorch]](https://github.com/chaiyujin/dctts-pytorch)[[tensorflow]](https://github.com/Kyubyong/dc_tts) 13 | - Microsoft's LightSpeech [[paper]] [[code]](https://github.com/microsoft/NeuralSpeech) 14 | - SpeechFormer [[paper]] [[code]](https://github.com/HappyColor/SpeechFormer) 15 | - Non-Attentive Tacotron [[paper]]() [[pytorch]](https://github.com/JoungheeKim/Non-Attentive-Tacotron) 16 | - Parallel Tacotron 2 [[paper]] [[code]](https://github.com/keonlee9420/Parallel-Tacotron2) 17 | - FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 [[paper]] [[code]](https://github.com/Wendison/FCL-taco2) 18 | - Transformer TTS: Neural Speech Synthesis with Transformer Network [[paper]] [[code]](https://github.com/soobinseo/Transformer-TTS) 19 | - VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [[paper]] [[code]](https://github.com/jaywalnut310/vits) 20 | - Reformer-TTS (adaptation of Reformer to TTS) [[code]](https://github.com/kowaalczyk/reformer-tts) 21 | 22 | * Prompt-based TTS (see [[link]](https://github.com/liusongxiang/Large-Audio-Models)) 23 | 24 | * Voice Conversion / Voice Cloning / Speaker Embedding 25 | - StarGan-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks [[paper]] [[code]](https://github.com/liusongxiang/StarGAN-Voice-Conversion) 26 | - Neural Voice Cloning with Few Audio Samples (Baidu) [[paper]] [[code]](https://github.com/VisionBrain/Neural_Voice_Cloning) 27 | - Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [[paper]] [[code]](https://github.com/mindslab-ai/assem-vc) 28 | - Unet-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning [[paper]](https://arxiv.org/abs/2109.11115) [[code]](https://github.com/CMsmartvoice/One-Shot-Voice-Cloning) 29 | - FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention [[paper]] [[code]](https://github.com/yistLin/FragmentVC) 30 | - VectorQuantizedCPC: Vector-Quantized Contrastive Predictive Coding for Acoustic Unit Discovery and Voice Conversion [[paper]] [[code]](https://github.com/bshall/VectorQuantizedCPC) 31 | - Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data [[paper]] [[code]](https://github.com/mindslab-ai/cotatron) 32 | - Again-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization [[paper]] [[code]](https://github.com/KimythAnly/AGAIN-VC) 33 | - AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss [[paper]] [[code]](https://github.com/auspicious3000/autovc) 34 | - SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [[code]](https://github.com/Edresson/SC-GlowTTS) 35 | - Deep Speaker: an End-to-End Neural Speaker Embedding System [[paper]] [[code]](https://github.com/philipperemy/deep-speaker) 36 | - VQMIVC: One-shot (any-to-any) Voice Conversion [[paper]] [[code]](https://github.com/Wendison/VQMIVC) 37 | 38 | * Style (Emotion, Prosody) 39 | - SMART-TTS Single Emotional TTS [[code]](https://github.com/SMART-TTS/SMART-Single_Emotional_TTS) 40 | - Cross Speaker Emotion Transfer [[paper]] [[code]](https://github.com/keonlee9420/Cross-Speaker-Emotion-Transfer) 41 | - AutoPST: Global Rhythm Style Transfer Without Text Transcriptions [[paper]] [[code]](https://github.com/auspicious3000/AutoPST) 42 | - Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [[paper]] [[code]](https://github.com/KunZhou9646/emotional-voice-conversion-with-CycleGAN-and-CWT-for-Spectrum-and-F0) 43 | - Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency [[paper]] [[code]](https://github.com/entn-at/acc-tacotron2) 44 | - Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis (Tacotron-VAE) [[paper]] [[code]](https://github.com/jinhan/tacotron2-vae) 45 | - Time Domain Neural Audio Style Transfer (NIPS 2017) [[paper]] [[code]](https://github.com/pkmital/time-domain-neural-audio-style-transfer) 46 | - Meta-StyleSpeech and StyleSpeech [[paper]] [[code]](https://github.com/KevinMIN95/StyleSpeech) 47 | - Cross-Speaker Emotion Transfer Based on Speaker Conditino Layer Normalization and Semi-Supervised Training in Text-to-Speech [[paper]] [[code]](https://github.com/keonlee9420/Cross-Speaker-Emotion-Transfer) 48 | 49 | * Cross-lingual 50 | - End-to-End Code-switching TTS with Cross-Lingual Language Model 51 | - mandarin and english 52 | - cross-lingual and multi-speaker 53 | - baseline: "Building a mixed-lingual neural TTS system with only monolingual data" 54 | - Building a mixed-lingual neural TTS system with only monolingual data 55 | - Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages 56 | - has many good references 57 | - Exploring Disentanglement with Multilingual and Monolingual VQ-VAE [[paper]](https://arxiv.org/pdf/2105.01573.pdf) [[code]](https://github.com/rhoposit/multilingual_VQVAE) 58 | 59 | * Music Related 60 | - Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022) [[paper]] [[code]](https://github.com/MoonInTheRiver/NeuralSVB) 61 | - Speech to Singing (Interspeech 2020) [[paper]] [[code]](https://github.com/ericwudayi/speech2singing) 62 | - DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (AAAI 2022) [[paper]] [[code]](https://github.com/MoonInTheRiver/DiffSinger) 63 | - A Universal Music Translation Network (ICLR 2019) 64 | - Jukebox: A Generative Model for Music (OpenAI) [[paper]](https://arxiv.org/pdf/2005.00341.pdf) [[code]](https://github.com/openai/jukebox/tree/master/jukebox) 65 | 66 | * Toolkits 67 | - IMS Toucan Speech Synthesis Toolkit [[paper]](http://festvox.org/blizzard/bc2021/BC21_IMS.pdf) [[code]](https://github.com/DigitalPhonetics/IMS-Toucan) 68 | - CREPE pitch tracker [[code]](https://github.com/maxrmorrison/torchcrepe) 69 | - SpeechBrain - Useful tools to facilitate speech research [[code]](https://github.com/speechbrain/speechbrain) 70 | 71 | * Vocoders 72 | * Attention 73 | - Local attention [[code]](https://github.com/lucidrains/local-attention) 74 | 75 | ## ASR 76 | - Towards End-to-End Spoken Language Understanding 77 | 78 | ## Speech Classification, Detection, Filter, etc. 79 | - HTS-AT: A Hierarchial Token-Semantic Audio Transformer for Sound Classification and Detection [[paper]] [[code]](https://github.com/RetroCirce/HTS-Audio-Transformer) 80 | - Google AI's VoiceFilter System [[paper]] [[code]](https://github.com/mindslab-ai/voicefilter) 81 | - Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning (Interspeech 2019) [[paper]] [[code]](https://github.com/KrishnaDN/speech-emotion-recognition-using-self-attention) 82 | - Multimodal Emotion Recognition with Tranformer-Based Self Supervised Feature Fusion [[paper]] [[code]](https://github.com/shamanez/Self-Supervised-Embedding-Fusion-Transformer) 83 | - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (Interspeech 2021) [[paper]] [[code]](https://github.com/habla-liaa/ser-with-w2v2) 84 | - Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition [[paper]] [[code]](https://github.com/b04901014/FT-w2v2-ser) 85 | - Rethinking CNN Models for Audio Classification [[paper]] [[code]](https://github.com/kamalesh0406/Audio-Classification) 86 | - EEG-based emotion recognition using SincNet [[paper]] [[code]](https://github.com/meiyor/SincNet-for-Autism-EEG-based-Emotion-Recognition) 87 | 88 | ## Speaker Verification 89 | - Cross attentive pooling for speaker verification (IEEE SLT 2021) [[paper]] [[code]](https://github.com/seongmin-kye/CAP) 90 | 91 | ## Linguistics 92 | 93 | 94 | -------------------------- 95 | 96 | 97 | ## Datasets 98 | 1. VGGSound: A Large-scale Audio-Visual Dataset [[paper]] [[code]](https://github.com/hche11/VGGSound) 99 | 2. CSS10: A collection of single speaker speech datsets for 10 langauges [[code]](https://github.com/Kyubyong/css10) 100 | 3. IEMOCAP: 12 hours of audiovisual data with 10 male and female actors [[website](https://sail.usc.edu/iemocap/iemocap_release.htm)] 101 | 4. VoxCeleb [[repo]](https://github.com/clovaai/voxceleb_trainer) 102 | 103 | ## Data Augmentation 104 | 1. Audiomentations (Fast audio data augmentation in pytorch) [[code]](https://github.com/asteroid-team/torch-audiomentations) 105 | 2. 106 | 107 | ## Aligners 108 | 1. Montreal Forced Aligner 109 | - For Korean [[link](https://chldkato.tistory.com/195)] 110 | 111 | ## Data (Pre)processing / Augmentation 112 | * Data (pre)processing 113 | 1. Korean pronunciation and romanization based on Wiktionary ko-pron lua module [[code]](https://github.com/kord123/ko_pron) 114 | 2. Audio Signal Processing [[code]](https://github.com/sooftware/Audio-Signal-Processing) 115 | 3. Phonological Features (for the paper "Phonological features for 0-shot multilingual speech synthesis") [[paper]] [[code]](https://github.com/papercup-open-source/phonological-features) 116 | 4. SMART-G2P (change English and Kanji expressions in Korean sentence into Korean pronunciation) [[code]](https://github.com/SMART-TTS/SMART-G2P) 117 | 5. Kakao Grapheme to Phoneme Conversion Package for "Mandarin" [[code]](https://github.com/kakaobrain/g2pM) 118 | 6. Webaverse Speech Tool [[code]](https://github.com/webaverse/LJSpeechTools) 119 | 120 | ## Verification 121 | 1. MCD [[repo]](https://github.com/jasminsternkopf/mel_cepstral_distance/tree/main) 122 | - Code works, but I am not sure if it is right. MCD numbers are a bit too high even for pairs of similar audios. 123 | 2. 124 | 125 | 126 | ## Other Research That May Help 127 | * Text to Image Synthesis 128 | - Dalle2 [[code]](https://github.com/lucidrains/DALLE2-pytorch) 129 | * AudioMAE (Masked Autoencoders that Listen) [[code]](https://github.com/rishikksh20/AudioMAE-pytorch) 130 | 131 | -------------------------- 132 | 133 | ## Organizations 134 | * DeepMind [[repo]](https://github.com/deepmind/deepmind-research) 135 | * OpenAI [[repo]](https://github.com/openai) 136 | * Club House: WeeklyArxivTalk [[repo]](https://github.com/jungwoo-ha/WeeklyArxivTalk) 137 | 138 | ## Other Repositories to Refer to - Speech Included/Related 139 | * Speech Researchers List [[repo]](https://github.com/mutiann/speech_rankings) 140 | * Jackson-Kang [[repo]](https://github.com/Jackson-Kang/Awesome-DL-based-Text-to-speech-Papers-and-Resources) 141 | * Rosinality's ML [[repo]](https://github.com/rosinality/ml-papers) 142 | * ivallesp's [[repo]](https://github.com/ivallesp/papers) 143 | * ddlBoJack's Speech Pretraining [[repo]](https://github.com/ddlBoJack/Awesome-Speech-Pretraining) 144 | * fuzhenxin's Style Transfer in Text [[repo]](https://github.com/fuzhenxin/Style-Transfer-in-Text) 145 | 146 | ## Learning Materials 147 | 1. Digital Signal Processing Lecture [[link]](https://github.com/spatialaudio/digital-signal-processing-lecture) 148 | 2. Ratsgo's Speechbook [[link]](https://github.com/ratsgo/speechbook) 149 | 3. YSDA Course in Speech Processing [[code]](https://github.com/yandexdataschool/speech_course) 150 | 4. NHN Forward Youtube video [[link]](https://www.youtube.com/watch?v=UGJMnRwL-mw) 151 | 152 | --------------------------------------------------------------------------------