└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # All About Speech
  2 | This repository organizes papers, learning materials, codes for the purpose of understanding speech. There is another repository for machine/deep learning [here](https://github.com/jinny1208/All-Resources-Related-to-ML-DL).
  3 | 
  4 | ## To Dos:
  5 | - organize stars
  6 | - add more papers 
  7 |   - papers to read:
  8 |     1. Speech=T:Transducer for TTS and Beyond
  9 | 
 10 | ## TTS
 11 | * TTS
 12 |   - DC-TTS [[paper]] [[pytorch]](https://github.com/chaiyujin/dctts-pytorch)[[tensorflow]](https://github.com/Kyubyong/dc_tts)
 13 |   - Microsoft's LightSpeech [[paper]] [[code]](https://github.com/microsoft/NeuralSpeech)
 14 |   - SpeechFormer [[paper]] [[code]](https://github.com/HappyColor/SpeechFormer)
 15 |   - Non-Attentive Tacotron [[paper]]() [[pytorch]](https://github.com/JoungheeKim/Non-Attentive-Tacotron)
 16 |   - Parallel Tacotron 2 [[paper]] [[code]](https://github.com/keonlee9420/Parallel-Tacotron2)
 17 |   - FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 [[paper]] [[code]](https://github.com/Wendison/FCL-taco2)
 18 |   - Transformer TTS: Neural Speech Synthesis with Transformer Network [[paper]] [[code]](https://github.com/soobinseo/Transformer-TTS)
 19 |   - VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [[paper]] [[code]](https://github.com/jaywalnut310/vits)
 20 |   - Reformer-TTS (adaptation of Reformer to TTS) [[code]](https://github.com/kowaalczyk/reformer-tts)
 21 | 
 22 | * Prompt-based TTS (see [[link]](https://github.com/liusongxiang/Large-Audio-Models))
 23 | 
 24 | * Voice Conversion / Voice Cloning / Speaker Embedding
 25 |   - StarGan-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks [[paper]] [[code]](https://github.com/liusongxiang/StarGAN-Voice-Conversion)
 26 |   - Neural Voice Cloning with Few Audio Samples (Baidu) [[paper]]  [[code]](https://github.com/VisionBrain/Neural_Voice_Cloning)
 27 |   - Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [[paper]] [[code]](https://github.com/mindslab-ai/assem-vc)
 28 |   - Unet-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning [[paper]](https://arxiv.org/abs/2109.11115) [[code]](https://github.com/CMsmartvoice/One-Shot-Voice-Cloning)
 29 |   - FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention [[paper]] [[code]](https://github.com/yistLin/FragmentVC)
 30 |   - VectorQuantizedCPC: Vector-Quantized Contrastive Predictive Coding for Acoustic Unit Discovery and Voice Conversion [[paper]] [[code]](https://github.com/bshall/VectorQuantizedCPC)
 31 |   - Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data [[paper]] [[code]](https://github.com/mindslab-ai/cotatron)
 32 |   - Again-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization [[paper]] [[code]](https://github.com/KimythAnly/AGAIN-VC)
 33 |   - AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss [[paper]] [[code]](https://github.com/auspicious3000/autovc)
 34 |   - SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [[code]](https://github.com/Edresson/SC-GlowTTS)
 35 |   - Deep Speaker: an End-to-End Neural Speaker Embedding System [[paper]] [[code]](https://github.com/philipperemy/deep-speaker)
 36 |   - VQMIVC: One-shot (any-to-any) Voice Conversion [[paper]] [[code]](https://github.com/Wendison/VQMIVC)
 37 | 
 38 | * Style (Emotion, Prosody)
 39 |   - SMART-TTS Single Emotional TTS [[code]](https://github.com/SMART-TTS/SMART-Single_Emotional_TTS)
 40 |   - Cross Speaker Emotion Transfer [[paper]] [[code]](https://github.com/keonlee9420/Cross-Speaker-Emotion-Transfer)
 41 |   - AutoPST: Global Rhythm Style Transfer Without Text Transcriptions [[paper]] [[code]](https://github.com/auspicious3000/AutoPST)
 42 |   - Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [[paper]] [[code]](https://github.com/KunZhou9646/emotional-voice-conversion-with-CycleGAN-and-CWT-for-Spectrum-and-F0)
 43 |   - Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency [[paper]] [[code]](https://github.com/entn-at/acc-tacotron2)
 44 |   - Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis (Tacotron-VAE) [[paper]] [[code]](https://github.com/jinhan/tacotron2-vae)
 45 |   - Time Domain Neural Audio Style Transfer (NIPS 2017) [[paper]] [[code]](https://github.com/pkmital/time-domain-neural-audio-style-transfer)
 46 |   - Meta-StyleSpeech and StyleSpeech [[paper]] [[code]](https://github.com/KevinMIN95/StyleSpeech)
 47 |   - Cross-Speaker Emotion Transfer Based on Speaker Conditino Layer Normalization and Semi-Supervised Training in Text-to-Speech [[paper]] [[code]](https://github.com/keonlee9420/Cross-Speaker-Emotion-Transfer)
 48 | 
 49 | * Cross-lingual
 50 |   - End-to-End Code-switching TTS with Cross-Lingual Language Model
 51 |     - mandarin and english
 52 |     - cross-lingual and multi-speaker
 53 |     - baseline: "Building a mixed-lingual neural TTS system with only monolingual data"
 54 |   - Building a mixed-lingual neural TTS system with only monolingual data
 55 |   - Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
 56 |     - has many good references
 57 |   - Exploring Disentanglement with Multilingual and Monolingual VQ-VAE [[paper]](https://arxiv.org/pdf/2105.01573.pdf) [[code]](https://github.com/rhoposit/multilingual_VQVAE)
 58 | 
 59 | * Music Related
 60 |   - Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022) [[paper]] [[code]](https://github.com/MoonInTheRiver/NeuralSVB)
 61 |   - Speech to Singing (Interspeech 2020) [[paper]] [[code]](https://github.com/ericwudayi/speech2singing)
 62 |   - DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (AAAI 2022) [[paper]] [[code]](https://github.com/MoonInTheRiver/DiffSinger)
 63 |   - A Universal Music Translation Network (ICLR 2019) 
 64 |   - Jukebox: A Generative Model for Music (OpenAI) [[paper]](https://arxiv.org/pdf/2005.00341.pdf) [[code]](https://github.com/openai/jukebox/tree/master/jukebox)
 65 | 
 66 | * Toolkits
 67 |   - IMS Toucan Speech Synthesis Toolkit [[paper]](http://festvox.org/blizzard/bc2021/BC21_IMS.pdf) [[code]](https://github.com/DigitalPhonetics/IMS-Toucan)
 68 |   - CREPE pitch tracker [[code]](https://github.com/maxrmorrison/torchcrepe)
 69 |   - SpeechBrain - Useful tools to facilitate speech research  [[code]](https://github.com/speechbrain/speechbrain)
 70 | 
 71 | * Vocoders
 72 | * Attention
 73 |   - Local attention [[code]](https://github.com/lucidrains/local-attention)
 74 | 
 75 | ## ASR
 76 | - Towards End-to-End Spoken Language Understanding 
 77 | 
 78 | ## Speech Classification, Detection, Filter, etc.
 79 | - HTS-AT: A Hierarchial Token-Semantic Audio Transformer for Sound Classification and Detection [[paper]] [[code]](https://github.com/RetroCirce/HTS-Audio-Transformer)
 80 | - Google AI's VoiceFilter System [[paper]] [[code]](https://github.com/mindslab-ai/voicefilter)
 81 | - Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning (Interspeech 2019) [[paper]] [[code]](https://github.com/KrishnaDN/speech-emotion-recognition-using-self-attention)
 82 | - Multimodal Emotion Recognition with Tranformer-Based Self Supervised Feature Fusion [[paper]] [[code]](https://github.com/shamanez/Self-Supervised-Embedding-Fusion-Transformer)
 83 | - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (Interspeech 2021) [[paper]] [[code]](https://github.com/habla-liaa/ser-with-w2v2)
 84 | - Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition [[paper]] [[code]](https://github.com/b04901014/FT-w2v2-ser)
 85 | - Rethinking CNN Models for Audio Classification [[paper]] [[code]](https://github.com/kamalesh0406/Audio-Classification)
 86 | - EEG-based emotion recognition using SincNet [[paper]] [[code]](https://github.com/meiyor/SincNet-for-Autism-EEG-based-Emotion-Recognition)
 87 | 
 88 | ## Speaker Verification
 89 | - Cross attentive pooling for speaker verification (IEEE SLT 2021) [[paper]] [[code]](https://github.com/seongmin-kye/CAP)
 90 | 
 91 | ## Linguistics
 92 | 
 93 | 
 94 | --------------------------
 95 | 
 96 | 
 97 | ## Datasets
 98 | 1. VGGSound: A Large-scale Audio-Visual Dataset [[paper]] [[code]](https://github.com/hche11/VGGSound)
 99 | 2. CSS10: A collection of single speaker speech datsets for 10 langauges [[code]](https://github.com/Kyubyong/css10)
100 | 3. IEMOCAP: 12 hours of audiovisual data with 10 male and female actors [[website](https://sail.usc.edu/iemocap/iemocap_release.htm)]
101 | 4. VoxCeleb [[repo]](https://github.com/clovaai/voxceleb_trainer)
102 | 
103 | ## Data Augmentation
104 | 1. Audiomentations (Fast audio data augmentation in pytorch) [[code]](https://github.com/asteroid-team/torch-audiomentations)
105 | 2. 
106 | 
107 | ## Aligners
108 | 1. Montreal Forced Aligner
109 |   - For Korean [[link](https://chldkato.tistory.com/195)]
110 | 
111 | ## Data (Pre)processing / Augmentation
112 | * Data (pre)processing
113 | 1. Korean pronunciation and romanization based on Wiktionary ko-pron lua module [[code]](https://github.com/kord123/ko_pron)
114 | 2. Audio Signal Processing [[code]](https://github.com/sooftware/Audio-Signal-Processing)
115 | 3. Phonological Features (for the paper "Phonological features for 0-shot multilingual speech synthesis") [[paper]] [[code]](https://github.com/papercup-open-source/phonological-features)
116 | 4. SMART-G2P (change English and Kanji expressions in Korean sentence into Korean pronunciation) [[code]](https://github.com/SMART-TTS/SMART-G2P)
117 | 5. Kakao Grapheme to Phoneme Conversion Package for "Mandarin" [[code]](https://github.com/kakaobrain/g2pM)
118 | 6. Webaverse Speech Tool [[code]](https://github.com/webaverse/LJSpeechTools)
119 | 
120 | ## Verification
121 | 1. MCD [[repo]](https://github.com/jasminsternkopf/mel_cepstral_distance/tree/main)
122 |   - Code works, but I am not sure if it is right. MCD numbers are a bit too high even for pairs of similar audios.
123 | 2.
124 | 
125 | 
126 | ## Other Research That May Help
127 | * Text to Image Synthesis
128 |   - Dalle2 [[code]](https://github.com/lucidrains/DALLE2-pytorch)
129 | * AudioMAE (Masked Autoencoders that Listen) [[code]](https://github.com/rishikksh20/AudioMAE-pytorch)
130 | 
131 | --------------------------
132 | 
133 | ## Organizations
134 | * DeepMind [[repo]](https://github.com/deepmind/deepmind-research)
135 | * OpenAI [[repo]](https://github.com/openai)
136 | * Club House: WeeklyArxivTalk [[repo]](https://github.com/jungwoo-ha/WeeklyArxivTalk)
137 | 
138 | ## Other Repositories to Refer to - Speech Included/Related
139 | * Speech Researchers List [[repo]](https://github.com/mutiann/speech_rankings)
140 | * Jackson-Kang [[repo]](https://github.com/Jackson-Kang/Awesome-DL-based-Text-to-speech-Papers-and-Resources)
141 | * Rosinality's ML [[repo]](https://github.com/rosinality/ml-papers)
142 | * ivallesp's [[repo]](https://github.com/ivallesp/papers)
143 | * ddlBoJack's Speech Pretraining [[repo]](https://github.com/ddlBoJack/Awesome-Speech-Pretraining)
144 | * fuzhenxin's Style Transfer in Text [[repo]](https://github.com/fuzhenxin/Style-Transfer-in-Text)
145 | 
146 | ## Learning Materials
147 | 1. Digital Signal Processing Lecture [[link]](https://github.com/spatialaudio/digital-signal-processing-lecture)
148 | 2. Ratsgo's Speechbook [[link]](https://github.com/ratsgo/speechbook)
149 | 3. YSDA Course in Speech Processing [[code]](https://github.com/yandexdataschool/speech_course)
150 | 4. NHN Forward Youtube video [[link]](https://www.youtube.com/watch?v=UGJMnRwL-mw)
151 | 
152 | 


--------------------------------------------------------------------------------