└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection 2 | 3 | 4 | ### Table II: THE CHALLENGE COMPETITIONS PROPOSED FOR DEEPFAKE SPEECH DETECTION 5 | | Challenge Competitions | Years | Data type | Languages | Public Label (train&dev/test) | Audio | Visual | Team No. | Top-1 System | 6 | | :---------------------: |:------:| :---:| :--------:| :----------------------------:| :----:| :----:| :--------:| :-------------:| 7 | | ASVspoof 2015 (audio) [[15](https://www.asvspoof.org/is2015_asvspoof.pdf)]| 2015 | Speech | English | Yes/Yes | Yes | No | 16 | Ensemble | 8 | | ASVspoof 2019 (LA Task) [[16](https://arxiv.org/pdf/1911.01601)] | 2019 | Speech | English | Yes/Yes | Yes | No | 48 | Ensemble | 9 | | DFDC [[17](https://arxiv.org/pdf/1910.08854)], [[18](https://arxiv.org/pdf/2006.07397)] | 2020 | Speech | English | Yes/Yes | Yes | Yes | 2114 | Ensemble | 10 | | FTC [[19](https://www.ftc.gov/news-events/contests/ftc-voice-cloning-challenge)] | 2020 | Speech | English | No/No | Yes | No | n/a | n/a | 11 | | ASVspoof 2021 (LA Task) [[20](https://arxiv.org/pdf/2109.00537)] | 2021 | Speech | English | Yes/Yes | Yes | No | 41 | Ensemble | 12 | | ASVspoof 2021 (DF Task) [[20](https://arxiv.org/pdf/2109.00537)] | 2021 | Speech | English | Yes/Yes | Yes | No | 33 | Ensemble | 13 | | ADD 2022 Track 1 [[21](http://addchallenge.cn/add2022)] | 2022 | Speech | Chinese | Yes/Yes | Yes | No | 48 | Single model | 14 | | ADD 2022 Track 2 [[21](http://addchallenge.cn/add2022)] | 2022 | Speech | Chinese | Yes/Yes | Yes | No | 27 | Single model | 15 | | ADD 2022 Track 3.2 [[21](http://addchallenge.cn/add2022)] | 2022 | Speech | Chinese | Yes/Yes | Yes | No | 33 | Single model | 16 | | ADD 2023 Track 1.2 [[22](http://addchallenge.cn/add2023)] | 2023 | Speech | Chinese | No/No | Yes | No | 49 | Ensemble model | 17 | | ADD 2023 Track 2 [[22](http://addchallenge.cn/add2023)] | 2023 | Speech | Chinese | No/No | Yes | No | 16 | Single model | 18 | | AV-Deepfake1M [[23](https://arxiv.org/pdf/2311.15308)], [[24](https://deepfakes1m.github.io/)] | 2024 | Speech | English | Yes/No | Yes | Yes | n/a | n/a | 19 | | ASVspoof 2024 [[25](https://www.asvspoof.org/)] | 2024 | Speech | English | Yes/No | Yes | No | 53 | Ensemble model | 20 | | SVDD 2024 [[26](https://challenge.singfake.org/)], [[27](https://arxiv.org/abs/2408.16132)] | 2024 | Singing | Multi-language (6) | Yes/No | Yes | No | 47 | Ensemble model | 21 | 22 | ### Table III: PUBLIC AND BENCHMARK DATASETS PROPOSED FOR DEEPFAKE SPEECH DETECTION 23 | | Dataset | Year | Language | Speakers (Male/Female) | Utt. No. (Real/Fake) | AI-Synthesized Speech Systems | Speech Condition | Real Speech Resources | Utt. length | Evaluation Metrics | 24 | | :-----: |:-----:| :-------:| :---------------------:| :-------------------:| :----------:| :----:| :--------:| :-------------:|:-------------:| 25 | | ASVspoof 2015 [[15](https://www.asvspoof.org/is2015_asvspoof.pdf)](audio)| 2015 | English | 45/61 | 16,651/246,500 | 10 | Clean | Speaker Volunteers | 1 to 2 |EER| 26 | | FoR [[30](https://bil.eecs.yorku.ca/wp-content/uploads/2020/01/FoR-Dataset_RR_VT_final.pdf)](audio)| 2019 | English | 33 | 198,000+ | 7 | Clean | Kaggle [[31](https://www.kaggle.com/datasets/percevalw/englishfrench-translations)] | 2.35 | Acc | 27 | | ASVspoof 2019 (LA task) [[16](https://arxiv.org/pdf/1911.01601)](audio)| 2019 | English | 46/61 | 121,461 | 19 | Clean & Noisy | Speaker Volunteers | n/a |EER| 28 | | DFDC [[32](https://arxiv.org/pdf/2006.07397)](video)| 2020 | English | 3426 | 12,8154/104,500 | 1 | Clean & Noisy | Speaker Volunteers | 68.8 | Precision/ Recall | 29 | | ASVspoof 2021 (LA task) [[20](https://arxiv.org/pdf/2109.00537)](audio)| 2021 | English | 21/27 | 18,452/163,114 | 13 | Clean & Noisy | Speaker Volunteers | n/a | EER | 30 | | ASVspoof 2021 (DF task) [[20](https://arxiv.org/pdf/2109.00537)](audio)| 2021 | English | 21/27 | 22,617/589,212 | 100+ | Clean & Noisy | Speaker Volunteers | n/a | EER | 31 | | WaveFake [[33](https://arxiv.org/pdf/2111.02813)](audio)| 2021 | English, Japanese | 0/2 | 117,985 | 6 | Clean | LJSPEECH [[29](https://arxiv.org/pdf/1802.08435)] & JSUT [[30](https://arxiv.org/pdf/1711.00354)] | 6s/4.8s | EER | 32 | | KoDF [[36](https://arxiv.org/pdf/2103.10094)](video)| 2021 | Korean | 198/205 | 62,116/175,776 | 2 | Clean | Speaker Volunteers | 90/15 (real/fake) | Acc & AuC | 33 | | ADD 2022 [[21](http://addchallenge.cn/add2022)]| 2022 | Chinese | 40/40 | 3012/24072 | 2 | Clean | AISHELL-3 [[37](https://www.isca-archive.org/interspeech_2021/shi21c_interspeech.html)] | 1s to 10s | EER | 34 | | FakeAVCeleb [[38](https://arxiv.org/pdf/2108.05080)](video) | 2022 | English | 250/250 | 570/25000 | 2 | Clean & Noisy | Vox-Celeb2 [[39](https://arxiv.org/pdf/1806.05622)] | 7s | AuC | 35 | | In-the-Wild [[40](https://arxiv.org/pdf/2203.16263)](audio) | 2022 | English | 58 | 31779 | 0 | Clean & Noisy | Self-collected | 4.3s | EER | 36 | | LAV-DF [[41](https://arxiv.org/pdf/2204.06228)](video) | 2022 | English | 153 | 36,431/99,873 | 1 | Clean & Noise | Vox-Celeb2 [[39](https://arxiv.org/pdf/1806.0562)] | 3s to 20s | AP | 37 | | Voc.v [[42](https://arxiv.org/pdf/2210.10570)](audio) | 2023 | English | 46/61 | 82,048 | 5 | Clean & Noisy | ASVspoofing 2019 LA | n/a | EER | 38 | | CFAD [[43](https://arxiv.org/abs/2207.12308)](audio) | 2023 | Chinese | 1023 | 374,000 | 12 | Clean & Noisy & Codecs| AISHELL1-3 [[44](https://arxiv.org/abs/1709.05522)], [[45](https://www.isca-archive.org/interspeech_2021/shi21c_interspeech.html)], MAGICDATA [[46](https://arxiv.org/abs/2203.16844)] | n/a | EER | 39 | | PartialSpoof [[47](https://arxiv.org/pdf/2204.05177)](audio) | 2023 | English | 46/61 | 12,483/108,87 | 19 | Clean & Noisy | ASVspoofing 2019 | 0.2s-6.4s | EER | 40 | | LibriSeVoc [[48](https://arxiv.org/pdf/2304.13085)](audio) | 2023 | English | n/a | 13,201/79,06 | 6 | Clean & Noisy | Librispeech | 5s-34s | EER | 41 | | AV-Deepfake1M [[23](https://arxiv.org/pdf/2311.15308)], [[24](https://deepfakes1m.github.io/)](video) | 2023 | English | 2,068 | 286,721/860,039 | 2 | Clean & Noisy | Vox-Celeb2 [[33](https://arxiv.org/pdf/1806.0562)] | 5s-35s | Acc & AuC | 42 | | MLAAD [[49](https://arxiv.org/pdf/2401.09512)](audio) | 2024 | Multi-Language([23](https://inria.hal.science/hal-01880206/)) | n/a | 76,000 | 54 | Clean & Noisy | M-AILABS [[50](https://github.com/imdatceleste/m-ailabs-dataset)] | n/a | Acc. | 43 | | ASVspoof 2024 [[25](https://www.asvspoof.org/)](audio)| 2024 | English | 964/958 | 188,819/815,262 | 28 | Clean & Noisy | MLS [[51](https://arxiv.org/pdf/2012.03411)] | n/a | EER | 44 | | SVDD2024 [[26](https://challenge.singfake.org/)](audio)| 2024 | Multi-Language (6) |59 | 12,169/72,235 | 48 | Clean | Mandarin & Japanese [[27](https://arxiv.org/abs/2408.16132)] | n/a | EER | 45 | 46 | ### Table IV: DEEPFAKE SPEECH GENERATION SYSTEMS USED IN PUBLIC DSD DATASETS (TTS: TEXT TO SPEECH, VC: VOICE CONVERSION, AT: ADVERSARIAL ATTACH USING MALAFIDE OR MALOCOPULA) 47 | | Datasets | Year | No. of TTS/VC/AT | Deepfake Speech Generation Systems | 48 | | :-----: |:-----:| :-------:| :---------------------:| 49 | | ASVspoof 2015 [[15](https://www.asvspoof.org/is2015_asvspoof.pdf)] | 2015 | 7 VC, 3 TTS | VC-01 [[52](https://ieeexplore.ieee.org/document/4218150)], [[53](https://www.isca-archive.org/interspeech_2013/wu13c_interspeech.html)], VC-02 [[54](https://ieeexplore.ieee.org/document/225953)], TTS-01 [[55](https://ieeexplore.ieee.org/document/4740153)], TTS-02 [[55](https://ieeexplore.ieee.org/document/4740153)], VC-03 [[56](http://www.festvox.org/)], VC-04 [[57](https://ieeexplore.ieee.org/document/4317579)], VC-05 [[57](https://ieeexplore.ieee.org/document/4317579)], VC-06 [[58](https://www.isca-archive.org/interspeech_2011/saito11_interspeech.pdf)], VC-07 [[59](https://ieeexplore.ieee.org/document/5995286)], TTS-03 [[60](https://github.com/marytts/marytts)] | 50 | | FoR [[30](https://bil.eecs.yorku.ca/wp-content/uploads/2020/01/FoR-Dataset_RR_VT_final.pdf)] | 2019 | 7 TTS | Deep Voice 3, Amazon AWS Polly, Baidu TTS, Google Traditional TTS, Google Cloud TTS, Google Wavenet TTS, Microsoft Azure TTS | 51 | |ASVspoof 2019 (LA task) [[16](https://arxiv.org/pdf/1911.01601)] |2019| 8 VC, 11 TTS | TTS-01 [[61](https://hts-engine.sourceforge.net/)], TTS-02 [[61](https://hts-engine.sourceforge.net/)], [[62](https://www.jstage.jst.go.jp/article/transinf/E99.D/7/E99.D_2015EDP7457/_article/-char/en)], TTS-03 [[63](https://www.isca-archive.org/ssw_2016/wu16_ssw.html)], TTS-04 [[64](https://aclanthology.org/L10-1498/)], VC-01 [[65](https://arxiv.org/abs/1610.04019)], VC-02 [[66](https://ieeexplore.ieee.org/document/1660175)], TTS-05 [[63](https://www.isca-archive.org/ssw_2016/wu16_ssw.html)], [[67](https://arxiv.org/abs/1904.02892)], TTS-06 [[61](https://hts-engine.sourceforge.net/)], [[68](https://arxiv.org/abs/1810.11946)], TTS-07 [[69](https://arxiv.org/abs/1606.06061)], [[70](https://ieeexplore.ieee.org/document/7178768)], TTS-08 [[71](https://arxiv.org/abs/1710.10467)], [[72](https://arxiv.org/abs/1802.08435)], TTS-09 [[71](https://arxiv.org/abs/1710.10467)], [[72](https://arxiv.org/abs/1802.08435)], [[73](https://ieeexplore.ieee.org/document/1164317)], TTS-10 [[74](https://arxiv.org/abs/1609.03499)], VC-03+TTS [[75](http://dws2.voicetext.jp/tomcat/demonstration/top.html)], VC-04+TTS [[76](https://www.isca-archive.org/interspeech_2018/liu18_interspeech.html)], [[77](https://www.sciencedirect.com/science/article/abs/pii/S0167639398000855)], VC-05+TTS [[76](https://www.isca-archive.org/interspeech_2018/liu18_interspeech.html)], [[77](https://www.sciencedirect.com/science/article/abs/pii/S0167639398000855)], TTS-11 [[64](https://aclanthology.org/L10-1498/)], VC-06 [[78](https://www.sciencedirect.com/science/article/pii/S0167639317303710)], [[79](https://arxiv.org/abs/1907.11898)], VC-07 [[80](https://ieeexplore.ieee.org/document/5545402)], [[81](https://www.isca-archive.org/odyssey_2012/kenny12_odyssey.html)], [[82](https://ieeexplore.ieee.org/document/4409052)], VC-08 [[66](https://ieeexplore.ieee.org/document/1660175)]| 52 | |DFDC [[32](https://arxiv.org/pdf/2006.07397)]| 2020| 1 TTS| TTS Skins voice conversion [[83](https://arxiv.org/abs/1904.08983)]| 53 | | KoDF [[36](https://arxiv.org/pdf/2103.10094)]|2021|2 TTS| ATFHP [[84](https://arxiv.org/abs/2002.10137)] and Wav2Lip [[85](https://arxiv.org/abs/2008.10010)]| 54 | |ASVspoof 2021 (LA task) [[20](https://arxiv.org/pdf/2109.00537)]|2021| 13 TTS/VC|Reuse ASVspoof 2019| 55 | |ASVspoof 2021 (DF task) [20](https://arxiv.org/pdf/2109.00537)]|2021|100 TTS/VC|Vocoders [[86](https://arxiv.org/abs/2210.02437)]| 56 | |WaveFake [[33](https://arxiv.org/pdf/2111.02813)]|2021|6 TTS|MelGAN [[87](https://arxiv.org/abs/1910.06711)], FB-MelGAN [[87](https://arxiv.org/abs/1910.06711)], HiFi-GAN [[88](https://arxiv.org/abs/2010.05646)], WaveGlow [[89](https://arxiv.org/abs/1807.03039)], PWG [[90](https://arxiv.org/abs/1910.11480)], MB-MelGAN [[87](https://arxiv.org/abs/1910.06711)]| 57 | |FakeAVCeleb [[38](https://arxiv.org/pdf/2108.05080)]|2022|2 TTS|SV2TTS [[91](https://arxiv.org/abs/1806.04558)], [[92](https://arxiv.org/abs/2008.10010)]| 58 | |In-the-Wild [[40](https://arxiv.org/pdf/2203.16263)]|2022|n/a|n/a| 59 | |LAV-DF [[41](https://arxiv.org/pdf/2204.06228)] |2022|1 TTS|SV2TTS [[93](https://arxiv.org/abs/1806.04558)]| 60 | |Voc.v [[42](https://arxiv.org/pdf/2210.10570)] |2023|5 TTS| HiFi-GAN [[88](https://arxiv.org/abs/2010.05646)], MB-MelGAN [[87](https://arxiv.org/abs/1910.06711)], WaveGlow [[89](https://arxiv.org/abs/1807.03039)], PWG [[90](https://arxiv.org/abs/1910.11480)], Hn-NSF [[94](https://arxiv.org/abs/1904.12088)]| 61 | |CFAD [[43](https://arxiv.org/abs/2207.12308)]|2023|11 TTS |STRAIGHT [[95](https://www.jstage.jst.go.jp/article/ast/27/6/27_6_349/_article)], Griffin-Lim [[96](https://ieeexplore.ieee.org/document/6701851)], LPCNet [[97](https://arxiv.org/abs/1810.11846)], WaveNet [[98](https://arxiv.org/abs/1609.03499)], PWG [[90](https://arxiv.org/abs/1910.11480)], HiFi-GAN [[99](https://arxiv.org/abs/2010.05646)], MB-MelGAN [[87](https://arxiv.org/abs/1910.06711)], MelGAN [[87](https://arxiv.org/abs/1910.06711)], WORLD [[100](https://www.jstage.jst.go.jp/article/transinf/E99.D/7/E99.D_2015EDP7457/_article/-char/en)], FastSpeech [[101](https://arxiv.org/abs/2006.04558)], Tacotron-HifiGAN [[102](https://arxiv.org/abs/1703.10135)]| 62 | |PartialSpoof [[47](https://arxiv.org/pdf/2204.05177)] |2023|21 TTS/VC |Reuse ASVspoof 2019| 63 | |LibriSeVoc [[48](https://arxiv.org/pdf/2304.13085)]|2023|6 TTS/VC|WaveNet [[98](https://arxiv.org/abs/1609.03499)], WaveRNN [[103](https://arxiv.org/abs/1802.08435)], MelGAN [[87](https://arxiv.org/abs/1910.06711)], Parallel WaveGAn [[104](https://arxiv.org/abs/1910.11480)], WaveGrad [[105](https://arxiv.org/abs/2009.00713)], DiffWave [[106](https://arxiv.org/abs/2009.09761)]| 64 | |AV-Deepfake1M [[23](https://arxiv.org/pdf/2311.15308)], [[24](https://deepfakes1m.github.io/)]|2023|2 TTS| VITS [[107](https://arxiv.org/abs/2106.06103)], YoursTTS [[108](https://arxiv.org/abs/2112.02418)]| 65 | |MLAAD [[49](https://arxiv.org/pdf/2401.09512)] |2024|54 TTS|Bark, Capacitron, FastPitch, GlowTTS, Griffin Lim, Jenny, NeuralHMM, Overflow, Parler TTS, Speech5, Tacotron DDC, Tacotron2, Tacotron2 DCA, Tacotron2 DH, Tcotron2-DDC, Tortoise, VITS, VITS Neon, VITS-MMS, XTTS v1.1, XTTS v2| 66 | |ASVspoof 2024 [[25](https://www.asvspoof.org/)]|2024|15 TTS, 6 VC, 7 AT|TTS-01 [[109](https://arxiv.org/abs/2005.11129)], TTS-02 [[110](https://arxiv.org/abs/2105.06337)], TTS-03 [[111](https://arxiv.org/abs/2006.06873)], TTS-04 [[112](https://arxiv.org/abs/2106.06103)], TTS-05 [[113](https://arxiv.org/abs/2210.12223)], TTS-06 [[114](https://arxiv.org/abs/2010.05646)], TTS-07 [[115](https://arxiv.org/abs/1712.05884)],TTS-08(self-develop), VC-01 [[116](https://arxiv.org/abs/2107.10394)], TTS-09 [[117](https://arxiv.org/abs/2112.02418)], VC-02 [[118](https://paperswithcode.com/paper/voice-conversion-using-speech-to-speech-neuro)], VC-03(self-develop), TTS-10 [[119](https://arxiv.org/abs/2312.14398)], AT-01 (Malafide+TTS-10 [[119](https://arxiv.org/abs/2312.14398)]), TTS-11 [[120](https://arxiv.org/abs/1712.04787)], AT-02(self-Develop), TTS-12 [[121](https://arxiv.org/abs/2206.04658)], TTS-13 [[122](https://arxiv.org/abs/2206.12229)], AT-03(Malafide+TTS [[123](https://arxiv.org/abs/2210.12223)]), VC-04(self-develop), VC-05 [[124](https://arxiv.org/abs/2109.13821)], VC-06(add noise), AT-04(Malacopula+VC-06), TTS-14 [[125](https://arxiv.org/abs/2112.02418)], TTS-15 [[126](https://arxiv.org/abs/2406.04904)], AT-05(Malacopula+AT-01), AT-06(Malacopula+TTS-13 [[122](https://arxiv.org/abs/2206.12229)]), AT-07(Malacopula+VC-05 [[124](https://arxiv.org/abs/2109.13821)])| 67 | 68 | --------------------------------------------------------------------------------