└── README.md /README.md: -------------------------------------------------------------------------------- 1 | > [!IMPORTANT] 2 | > 3 | > A curated collection of papers and resources on Audio Deepfake Detection (ADD). 4 | > 5 | > Please refer to our survey [**"Research progress on speech deepfake and its detection techniques"**](https://www.cjig.cn/en/article/doi/10.11834/jig.230476/?viewType=HTML) for the detailed contents. [![Paper page](https://huggingface.co/datasets/huggingface/badges/raw/main/paper-page-sm-dark.svg)](https://www.cjig.cn/en/article/doi/10.11834/jig.230476/?viewType=HTML) 6 | > 7 | > Please let us know if you discover any mistakes or have suggestions by emailing us: xuyuxiong2022@email.szu.edu.cn 8 | 9 | # Table of contents 10 | 11 | - [Table of contents](#table-of-contents) 12 | - [What's New](#whats-new) 13 | - [Survey](#survey) 14 | - [Top Repositories](#top-repositories) 15 | - [Audio Large Model](#audio-large-model) 16 | - [Datasets](#datasets) 17 | - [Audio Preprocessing](#audio-preprocessing) 18 | - [Commonly Used Noise Datasets](#commonly-used-noise-datasets) 19 | - [Audio Enhancement Methods](#audio-enhancement-methods) 20 | - [Feature Extraction](#feature-extraction) 21 | - [Handcrafted Feature-based Forgery Detection](#handcrafted-feature-based-forgery-detection) 22 | - [Hybrid Feature-based Forgery Detection](#hybrid-feature-based-forgery-detection) 23 | - [End-to-end Forgery Detection](#end-to-end-forgery-detection) 24 | - [Feature Fusion-based Forgery Detection](#feature-fusion-based-forgery-detection) 25 | - [Network Training](#network-training) 26 | - [Multi-task Learning-based Forgery Detection](#multi-task-learning-based-forgery-detection) 27 | - [Reference](#reference) 28 | - [Statement](#statement) 29 | 30 | # What's New 31 | - [Update Jun. 3, 2025] 🔥🔥🔥 Updated [Datasets](#Datasets), [Survey](#survey), and [Top Repositories](#top-repositories). 32 | - [Update Feb. 20, 2025] Updated [Datasets](#Datasets), Added [Survey](#survey) and [Top Repositories](#top-repositories). 33 | 34 | # Survey 35 | [**⬆ Back to top**](#table-of-contents) 36 | - `【2022-04】-【ADD-Survey】-【Ben Gurion University】` 37 | - **A study on data augmentation in voice anti-spoofing** 38 | - **Author(s):** Ariel Cohen, Inbal Rimon, Eran Aflalo, Haim H. Permuter 39 | - [Paper](https://doi.org/10.1016/j.specom.2022.04.005) [Code](https://github.com/InbalRim/A-Study-On-Data-Augmentation-In-Voice-Anti-Spoofing) 40 | - `【2022-05】-【ADD-Survey】-【King Saud University】` 41 | - **A review of modern audio deepfake detection methods: challenges and future directions** 42 | - **Author(s):** Zaynab Almutairi, and Hebah Elgibreen 43 | - [Paper](https://doi.org/10.3390/a15050155) 44 | - `【2023-01】-【ADD-Survey】-【University of Maryland Baltimore County】` 45 | - **Audio deepfakes: A survey** 46 | - **Author(s):** Zahra Khanjani, Gabrielle Watson and Vandana P. Janeja 47 | - [Paper](10.3389/fdata.2022.1001063) 48 | - `【2023-02】-【ADD-Survey】-【Oakland University】` 49 | - **Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures** 50 | - **Author(s):** Awais Khan, Khalid Mahmood Malik, James Ryan, and Mikul Saravanan 51 | - [Paper](https://doi.org/10.1007/s10462-023-10539-8) [Code](https://github.com/smileslab/Comparative-Analysis-Voice-Spoofing) 52 | - `【2023-04】-【ADD-Survey】-【Panjab University】` 53 | - **Review of audio deepfake detection techniques: Issues and prospects** 54 | - **Author(s):** Abhishek Dixit, Nirmal Kaur, Staffy Kingra 55 | - [Paper](https://doi.org/10.1111/exsy.13322) 56 | - `【2023-07】-【ADD-Survey】-【Indian Institute of Technology】` 57 | - **Uncovering the deceptions: an analysis on audio spoofing detection and future prospects** 58 | - **Author(s):** Rishabh Ranjan, Mayank Vatsa, Richa Singh 59 | - [Paper](https://doi.org/10.24963/ijcai.2023/756) 60 | - `【2023-08】-【ADD-Survey】-【CASIA】` 61 | - **Audio deepfake detection: a survey** 62 | - **Author(s):** Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao 63 | - [Paper](https://arxiv.org/abs/2308.14970) 64 | - `【2024-02】-【Multimodal-Survey】-【Purdue University, Nanchang University】` 65 | - **Detecting Multimedia Generated by Large AI Models: A Survey** 66 | - **Author(s):** Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu 67 | - [Paper](https://arxiv.org/abs/2402.00045) [Code](https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey) 68 | - `【2024-04】-【ADD-Survey】-【Toronto Metropolitan University】` 69 | - **A Survey on Speech Deepfake Detection** 70 | - **Author(s):** Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang 71 | - [Paper](https://doi.org/10.1145/3714458) 72 | - `【2024-05】-【Multimodal-Survey】-【Cool Large Language Models Research Group, Renmin University of China】` 73 | - **Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities** 74 | - **Author(s):** Xiaomin Yua, Yezhaohui Wanga, Yanfang Chenb, Zhen Tao, Dinghao Xi, Shichao Song, and Simin Niu 75 | - [Paper](https://arxiv.org/abs/2405.00711) 76 | - `【2024-09】-【ADD-Survey】-【Austrian Institute of Technology】` 77 | - **A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection** 78 | - **Author(s):** Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Huyen Nguyen, Alexander Schindler, Hai Canh Vu 79 | - [Paper](https://doi.org/10.1016/j.cosrev.2025.100757) [Code](https://github.com/AI-ResearchGroup/A-Comprehensive-Survey-with-Critical-Analysis-for-Deepfake-Speech-Detection) 80 | - `【2024-11】-【Multimodal-Survey】-【University of Bucharest】` 81 | - **Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook** 82 | - **Author(s):** Forinel-Alin Croitoru, Andrei-Iulian Hıˆji, Vlad Hondru, Nicolae Cat ̆alin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah 83 | - [Paper](https://arxiv.org/abs/2411.19537) [Code](https://github.com/CroitoruAlin/biodeep) 84 | - `【2024-11】-【Multimodal-Survey】-【University College Dublin】` 85 | - **Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey** 86 | - **Author(s):** Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac 87 | - [Paper](https://arxiv.org/abs/2411.17911) 88 | - `【2024-12】-【ADD-Survey】-【Imperial College London, Technical University of Munich】` 89 | - **From Audio Deepfake Detection to AI-Generated Music Detection——A Pathway and Overview** 90 | - **Author(s):** Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller 91 | - [Paper](https://arxiv.org/abs/2412.00571) 92 | - `【2025-02】-【Multimodal-Survey】-【BUPT, University of California, CASIA】` 93 | - **Survey on AI-Generated Media Detection: From Non-MLLM to MLLM** 94 | - **Author(s):** Yueying Zou, Peipei Li, Zekun Li, Huaibo Huang, Xing Cui, Xuannan Liu, Chenghanyu Zhang, Ran He 95 | - [Paper](https://arxiv.org/abs/2502.05240) 96 | 97 | # Top Repositories 98 | [**⬆ Back to top**](#table-of-contents) 99 | - **Audio Large Language Models** 100 | - Resources on Audio Large Language Models, including datasets, methods, benchmarks, and studies. 101 | - [Repo](https://github.com/AudioLLMs/Awesome-Audio-LLM) 102 | - **awesome-fake-audio-detection** 103 | - A list of tools, papers and code related to Fake Audio Detection. 104 | - [Repo](https://github.com/john852517791/awesome-fake-audio-detection?tab=readme-ov-file) 105 | - **ASVspoof Challenge Official Repository** 106 | - Sub-repositories include [ASVspoof 5](https://github.com/asvspoof-challenge/asvspoof5), [ASVspoof 2021](https://github.com/asvspoof-challenge/2021), and [classifier-adjacency](https://github.com/asvspoof-challenge/classifier-adjacency). 107 | - [Repo](https://github.com/asvspoof-challenge) 108 | 109 | # Audio Large Model 110 | [**⬆ Back to top**](#table-of-contents) 111 | | Model | Publisher | Years| Achievable Tasks | 112 | |:----:|:-------------:|:--------------:|:--------------------| 113 | | AudioLM
[Paper](https://arxiv.org/pdf/2209.03143) [Website](https://google-research.github.io/seanet/audiolm/examples/) [Code](https://github.com/lucidrains/audiolm-pytorch) | Google | 2022.09 |1. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space.
2. Speech continuation, Acoustic generation, Unconditional generation, Generation without semantic tokens, and Piano continuation. | 114 | | VALL-E
[Paper](https://arxiv.org/abs/2301.02111) [Website](https://www.microsoft.com/en-us/research/project/vall-e-x/) | Microsoft | 2023.01|1. Simply record a 3-second registration of an unseen speaker to create a high-quality personalised speech.
2. VALL-E X: Cross-lingual speech synthesis. | 115 | | USM
[Website](https://sites.research.google/usm/) | Google | 2023.03| 1. ASR beyond 100 languages.
2. Downstream ASR tasks.
3. Automated Speech Translation (AST). | 116 | | SpeechGPT
[Website](https://github.com/0nutation/SpeechGPT) | Fudan University |2023.05| 1. Perceive and generate multi-modal contents.
2. Spoken dialogue LLM with strong human instruction. | 117 | | Pengi
[Paper](https://arxiv.org/pdf/2305.11834.pdf) [Website](https://github.com/microsoft/Pengi) | Microsoft |2023.05| 1. an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks.
2. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. | 118 | | VoiceBox
[Website](https://voicebox.metademolab.com/) | Meta | 2023.06| 1. Synthesize speech across six languages.
2. Remove transient noise.
3. Edit content.
4. Transfer audio style within and across languages.
5. Generate diverse speech samples. | 119 | | AudioPaLM
[Paper](https://arxiv.org/pdf/2306.12925) [Website](https://google-research.github.io/seanet/audiopalm/examples) | Google | 2023.06|1. Speech-to-speech translation.
2. Automatic Speech Recognition (ASR). | 120 | 121 | # Datasets 122 | [**⬆ Back to top**](#table-of-contents) 123 | | Attack Types | Years| Dataset | Number of Audio
(Subdataset:Real/Fake) | Language | 124 | |:-----------:|:------------:|:------------:|:-------------:|:------------:| 125 | |TTS|2019|FoR
[Paper](https://ieeexplore.ieee.org/document/8906599) [Dataset](http://bil.eecs.yorku.ca/datasets)|111,000/87,000|English| 126 | |TTS|2021|WaveFake
[Paper](https://arxiv.org/abs/2111.02813) [Dataset](https://zenodo.org/record/5642694)|16,283/117,985|English, Japanese| 127 | |TTS|2021|Half-truth
[Paper](https://arxiv.org/abs/2104.03617)|53,612/107,224|Chinese| 128 | |TTS|2021|HAD
[Paper](https://arxiv.org/abs/2104.03617)|53,612/107,224|Chinese| 129 | |TTS|2021|FakeAVCeleb
[Paper](https://arxiv.org/pdf/2108.05080) [Dataset](https://github.com/DASH-Lab/FakeAVCeleb/tree/main)|10,209/11,335|English| 130 | |TTS|2022|ADD 2022
[Paper](https://arxiv.org/pdf/2202.08433v2.pdf)|LF: 5,619/46,067
PF: 5,319/46,419
FG-D: 5,319/46,206|Chinese| 131 | |TTS|2022|CMFD
[Paper](https://ieeexplore.ieee.org/document/9921343) [Dataset](https://github.com/WuQinfang/CMFD)|Chinese: 1,800/1,000
English: 1,800/1,000|English, Chinese| 132 | |TTS|2022|In-the-Wild
[Paper](https://arxiv.org/abs/2203.16263) [Dataset](https://deepfake-demo.aisec.fraunhofer.de/in_the_wild)|19,963/11,816|English| 133 | |TTS|2022|CFAD
[Paper](https://arxiv.org/abs/2207.12308) [Dataset](https://zenodo.org/record/8122764)|38,600/77,200|Chinese| 134 | |TTS|2022|Psynd
[Paper](https://ieeexplore.ieee.org/document/9956134) [Dataset](https://scholarbank.nus.edu.sg/handle/10635/227398)|2,294|English| 135 | |TTS|2022|TIMIT-TTS
[Paper](https://arxiv.org/abs/2209.08000) [Dataset](https://zenodo.org/records/6560159)|0/79,120|English| 136 | |TTS|2023|ODSS
[Paper](https://ieeexplore.ieee.org/abstract/document/10374863) [Dataset](https://zenodo.org/records/8370669)|11,032/18,993|English, German,
and Spanish| 137 | |TTS|2024|MLAAD
[Paper](https://arxiv.org/abs/2401.09512) [Dataset](https://deepfake-total.com/mlaad)|-/76,000|Multi-lingual| 138 | |TTS|2024|CD-ADD
[Paper](https://arxiv.org/abs/2404.04904)|300 hours|English| 139 | |TTS|2024|DiffSSD
[Paper](https://arxiv.org/abs/2409.13049) [Dataset](https://huggingface.co/datasets/purdueviperlab/diffssd)|24,226/70,000|English| 140 | |TTS|2024|ACCENT
[Paper](https://arxiv.org/abs/2409.08346) |53,651/192,461|Multi-lingual| 141 | |TTS|2024|SpoofCeleb
[Paper](https://arxiv.org/abs/2409.17285) [Dataset](http://www.jungjee.com/spoofceleb/)|250k+|English| 142 | |TTS|2024|LlamaPartialSpoof
[Paper](https://arxiv.org/abs/2409.14743) [Dataset](https://github.com/hieuthi/LlamaPartialSpoof)|10,573/33,479|English| 143 | |TTS|2024|FakeSound
[Paper](https://arxiv.org/abs/2406.08052) [Dataset](https://github.com/FakeSoundData/FakeSound)|-/3,798|English| 144 | |TTS|2024|DFADD
[Paper](https://arxiv.org/abs/2409.08731) [Dataset](https://github.com/isjwdu/DFADD)|44,455/163,500|English| 145 | |TTS|2024|SONAR
[Paper](https://arxiv.org/abs/2410.04324) [Dataset](https://github.com/Jessegator/SONAR)|-/2,274|Multi-lingual| 146 | |TTS|2025|6KSFx
[Paper](https://arxiv.org/abs/2501.17198) [Dataset](https://github.com/nellyngz95/6KSFX)|-/6,000|-| 147 | |TTS|2025|BangalFake
[Paper](https://arxiv.org/html/2505.10885) [Dataset](https://huggingface.co/datasets/sifat1221/banglaFake)|12,260/13,260|Bengali| 148 | |Replay|2017|ASVspoof 2017
[Paper](https://www.asvspoof.org/asvspoof2017overview_cameraReady.pdf) [Dataset](https://datashare.ed.ac.uk/handle/10283/3055)|3,565/14,465|English| 149 | |Replay|2019|ReMASC
[Paper](https://arxiv.org/abs/1904.03365) [Dataset](https://github.com/YuanGongND/ReMASC)|9,240/45,472|English, Chinese,
Hindi| 150 | |Replay|2019|VSDC
[Paper](https://arxiv.org/abs/1909.00935) [Dataset](http://www.secs.oakland.edu/∼mahmood/datasets/audiospoof)|1,687/11,772|English| 151 | |Replay|2024|POLIPHONE
[Paper](https://arxiv.org/abs/2410.06221) [Dataset](https://zenodo.org/records/13903412)|41,941|English| 152 | |TTS
and VC|2015|AVspoof
[Paper](https://ieeexplore.ieee.org/document/7358783) [Dataset](https://zenodo.org/record/4081040)|LA: 15,504/120,480
PA: 15,504/14,465|English| 153 | |TTS
and VC|2015|ASVspoof 2015
[Paper](https://datashare.ed.ac.uk/bitstream/handle/10283/853/ASVspoof2015_evaluation_plan.pdf?sequence=2&isAllowed=y) [Dataset](http://dx.doi.org/10.7488/ds/298)|16,651/246,500|English| 154 | |TTS
and VC|2021|FMFCC-A
[Paper](https://arxiv.org/abs/2110.09441) [Dataset](https://github.com/Amforever/FMFCC-A)|10,000/40,000|Chinese| 155 | |TTS
and VC|2022|SceneFake
[Paper](https://arxiv.org/abs/2211.06073) [Dataset](https://zenodo.org/record/7663324#.Y_XKMuPYuUk)|19,838/64,642|English| 156 | |TTS
and VC|2022|EmoFake
[Paper](http://arxiv.org/abs/2211.05363)|35,000/53,200|English, Chinese| 157 | |TTS
and VC|2023|PartialSpoof
[Paper](https://arxiv.org/abs/2104.02518) [Dataset](https://zenodo.org/record/4817532#.YLd8Yi2l1hF)|12,483/108,978|English| 158 | |TTS
and VC|2023|ADD 2023
[Paper](https://arxiv.org/abs/2305.13774)|FG-D: 172,819/113,042
RL: 55,468/65,449
AR: 14,907/95,383|Chinese| 159 | |TTS
and VC|2023|DECRO
[Paper](https://dl.acm.org/doi/10.1145/3543507.3583222) [Dataset](https://github.com/petrichorwq/DECRO-dataset)|Chinese: 21,218/41,880
English: 12,484/42,799|English, Chinese| 160 | |TTS
and VC|2023|HABLA
[Paper](10.21437/Interspeech.2023-2272) [Dataset](https://github.com/Ruframapi/HABLA)|22,000/58,000|Spanish| 161 | |TTS
and VC|2024|DeepFakeVox-HQ
[Paper](https://ieeexplore.ieee.org/document/10800535)|693k/643k|English| 162 | |TTS
and VC|2024|VoiceWukong
[Paper](https://arxiv.org/abs/2409.06348) [Dataset](https://voicewukong.github.io)|5,300/413,400|English, Chinese| 163 | |TTS
and VC|2024|Speech-Forensics
[Paper](https://www.ijcai.org/proceedings/2024/0046) [Dataset](https://github.com/ring-zl/Speech-Forensics)|13,100/7,452|English| 164 | |TTS
and VC|2024|VoiceEdit
[Paper](https://arxiv.org/abs/2402.06304)|-|Multi-lingual| 165 | |TTS
and VC|2024|RFP
[Paper](https://arxiv.org/abs/2404.17721) [Dataset](https://zenodo.org/records/10202142)|28,115/74,199|English| 166 | |TTS
and VC|2025|MADD
[Paper](https://ieeexplore.ieee.org/document/10800535)|60,000/129,990|Multi-lingual| 167 | |TTS
and VC|2025|XMAD-Bench
[Paper](https://arxiv.org/pdf/2506.00462) [Dataset](https://github.com/ristea/xmad-bench/)|414,858|Multi-lingual| 168 | |TTS
and Vocoder|2024|Diffuse or Confuse
[Paper](https://arxiv.org/abs/2410.06796) [Dataset](https://github.com/AntonFirc/diffusion-deepfake-speech-dataset/)|131,000/183,400|English| 169 | |TTS
and Vocoder|2025|ShiftySpeech
[Paper](https://arxiv.org/abs/2502.05674) [Dataset](https://huggingface.co/datasets/ash56/ShiftySpeech)|3,000+ hours|English, Chinese,
Japanese| 170 | |TTS, VC
and Replay|2019|ASVspoof 2019
[Paper](https://arxiv.org/abs/1904.05441) [Dataset](https://datashare.ed.ac.uk/handle/10283/3336)|LA: 12,483/108,978
PA: 28,890/189,540|English| 171 | |TTS, VC
and Replay|2021|ASVspoof 2021
[Paper](https://arxiv.org/abs/2109.00535) [Dataset](https://www.asvspoof.org/index2021.html)|LA: 18,452/163,114
PA: 126,630/816,480
PF: 14,869/519,059|English| 172 | |TTS, VC
and Vocoder|2024|SpeechFake
[Paper](https://openreview.net/forum?id=GpUO6qYNQG)|3,000+ hours|English| 173 | |VC, Replay
and Adversarial|2024|VSASV
[Paper](https://www.isca-archive.org/interspeech_2024/hoang24b_interspeech.html) [Dataset](https://github.com/hustep-lab/VSASV-Dataset)|164,000/174,000|Multi-lingual| 174 | |TTS, VC
and Adversarial|2024|ASVspoof 5
[Paper](https://arxiv.org/abs/2502.08857) [Dataset](https://zenodo.org/records/14498691)|188,819/ 815,262|English| 175 | |Voice Cloning|2021|RTVCSpoof
[Paper](https://ojs.aaai.org/index.php/AAAI/article/view/6044/5900)|3,284/4,843|English| 176 | |Voice Cloning|2024|Kratika Dataset
[Paper](https://doi.org/10.1145/3658664.3659658)|24,226/25,000|English| 177 | |Vocoder|2022|Yan Dataset
[Paper](https://doi.org/10.1145/3552466.3556525)|8,200/63,200|Chinese| 178 | |Vocoder|2023|LibriSeVoc
[Paper](https://arxiv.org/abs/2304.13085) [Dataset](https://github.com/csun22/SyntheticVoice-Detection-Vocoder-Artifacts)|13,201/79,206|English| 179 | |Vocoder|2023|Voc.v1-v4
[Paper](https://arxiv.org/abs/2210.10570) [Dataset](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/09-asvspoof-vocoded-trn)|2,580/10,320|English| 180 | |Vocoder|2024|MLADDC
[Paper](https://openreview.net/forum?id=ic3HvoOTeU) [Dataset](https://speech007.github.io/MLADDC_Nips/)|80k/160k|Multi-lingual| 181 | |Vocoder|2024|CVoiceFake
[Paper](https://dl.acm.org/doi/10.1145/3658644.3670285) [Dataset](https://safeearweb.github.io/Project/)|23,544/91,700|Multi-lingual| 182 | |Impersonation|2024|IPAD
[Paper](https://arxiv.org/abs/2408.17009)|5,170/18,874|Chinese| 183 | |Text-To-Music (TTM)|2024|FSD
[Paper](https://arxiv.org/abs/2309.02232) [Dataset](https://github.com/xieyuankun/FSD-Dataset)|200/500|Chinese| 184 | |Text-To-Music (TTM)|2024|SingFake
[Paper](https://arxiv.org/abs/2309.07525) [Dataset](https://github.com/yongyizang/SingFake)|634/671|Multi-lingual| 185 | |Text-To-Music (TTM)|2024|CtrSVDD
[Paper](https://arxiv.org/abs/2406.02438) [Dataset](https://github.com/SVDDChallenge/)|32,312/188,486|Multi-lingual| 186 | |Text-To-Music (TTM)|2024|FakeMusicCaps
[Paper](https://arxiv.org/abs/2409.10684) [Dataset](https://zenodo.org/records/13732524)|5.5k/27,605|English| 187 | |Text-To-Music (TTM)|2025|SONICS
[Paper](https://arxiv.org/abs/2408.14080)|48,090/49,074|English| 188 | |Text-To-Music (TTM)|2025|SingNet
[Paper](hhttps://arxiv.org/pdf/2505.09325) [Dataset](https://singnet-dataset.github.io/)|2,963.4 hours|Multi-lingual| 189 | |Codec-based Speech
Generation (CoSG)|2024|CodecFake
[Paper](https://arxiv.org/abs/2406.07237) [Dataset](https://codecfake.github.io/)|42,752/45,045|English| 190 | |Codec-based Speech
Generation (CoSG)|2024|ALM-ADD
[Paper](https://arxiv.org/abs/2408.10853) [Dataset](https://github.com/xieyuankun/ALM-ADD)|123/210|English| 191 | |Codec-based Speech
Generation (CoSG)|2024|Codecfake
[Paper](https://ieeexplore.ieee.org/abstract/document/10830534) [Dataset](https://zenodo.org/records/13838106)|132,277/925,939|English, Chinese| 192 | |Codec-based Speech
Generation (CoSG)|2025|ST-Codecfake
[Paper](https://arxiv.org/pdf/2501.06514) [Dataset](https://zenodo.org/records/14631091)|13,228/145,778|English, Chinese| 193 | |Codec-based Speech
Generation (CoSG)|2025|CodecFake+
[Paper](https://arxiv.org/abs/2501.08238)|90,163/1,423,894|English| 194 | 195 | 196 | 197 | 198 | # Audio Preprocessing 199 | 200 | ## Commonly Used Noise Datasets 201 | [**⬆ Back to top**](#table-of-contents) 202 | | Dataset | Description | 203 | |:--------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| 204 | | MUSAN
[Dataset](https://www.openslr.org/17/) | A corpus of music, speech and noise | 205 | | RIR
[Dataset](https://www.openslr.org/28/) | A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision. | 206 | | NOIZEUS
[Dataset](https://ecs.utdallas.edu/loizou/speech/noizeus/) | Contains 30 IEEE sentences (generated by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. Noises include suburban train noise, murmur, car, exhibition hall, restaurant, street, airport and train station noise. | 207 | | NoiseX-92
[Dataset](http://spib.linse.ufsc.br/noise.html) | All noises are obtained with a duration of 235 seconds, a sampling rate of 19.98 KHz, an analogue-to-digital converter (A/D) with 16 bits, an anti-alias filter and no pre-emphasis stage. Fifteen noise types are included. | 208 | | DEMAND
[Dataset](https://zenodo.org/record/1227121#.YXtsyPlBxjU) | Multi-channel acoustic noise database for diverse environments. | 209 | | ESC-50
[Dataset](https://github.com/karolpiczak/ESC-50) | A tagged collection of 2000 environmental audios obtained from clips in Freesound.org, suitable for environmental sound classification. The dataset consists of 5-second-long recordings organised into 5 broad categories, each with 10 subcategories (40 examples per subcategory). | 210 | | ESC
[Dataset](http://shujujishi.com/dataset/69b2bf03-d855-4f8b-ab96-1ec80e285863.html) | Including the ESC-50, ESC-10, and ESC-US. | 211 | | FSD50K
[Dataset](https://zenodo.org/record/4060432#.Y1kvcHZByUk) | An open dataset of human tagged sound events containing 51,197 Freesound clips totalling 108.3 hours of multi-labeled audio, unequally distributed across 200 classes from the AudioSet Ontology. | 212 | 213 | ## Audio Enhancement Methods 214 | [**⬆ Back to top**](#table-of-contents) 215 | | Method | Description | 216 | |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------:| 217 | | SpecAugment
[Paper](https://arxiv.org/pdf/1904.08779.pdf) [Code](https://github.com/DemisEom/SpecAugment) | Enhancement strategies include time warping, frequency masking and time masking | 218 | | WavAugment
[Paper](https://arxiv.org/abs/2007.00991) [Code](https://github.com/facebookresearch/WavAugment) | Enhancement strategies include pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject and clipping | 219 | | RawBoost
[Paper](https://arxiv.org/abs/2007.00991) [Code](https://github.com/TakHemlata/RawBoost-antispoofing) | Enhancement strategies include linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise | 220 | 221 | # Feature Extraction 222 | 223 | ## Handcrafted Feature-based Forgery Detection 224 | [**⬆ Back to top**](#table-of-contents) 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 |
PaperAudio Deepfake DetectionResults
Data AugmentationFeature ExtractionNetwork FrameworkLoss FunctionEER (%)t-DCF
Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge
Paper Code
CQT, Power SpectrumVGG, SincNetCELA: 8.01 (4)
PA: 1.51 (2)
LA: 0.208 (4)
PA: 0.037 (1)
Long-term high frequency features for synthetic speech detection
Paper
Cafe, White and Street NoiseICQC, ICQCC, ICBC, ICLBCDNNCELA: 7.78 (3)LA: 0.187 (3)
Voice spoofing countermeasure for logical access attacks detection
Paper
ELTP-LFCCDBiLSTMLA: 0.74 (1)LA: 0.008 (1)
Voice spoofing detector: A unified anti-spoofing framework
Paper
ATP-GTCCSVMHamming
Distance
LA: 0.75 (2)
PA: 1.00 (1)
LA: 0.050 (2)
PA: 0.064 (2)
276 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario. 277 | 278 | ## Hybrid Feature-based Forgery Detection 279 | [**⬆ Back to top**](#table-of-contents) 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 |
PaperAudio Deepfake DetectionResults
Data AugmentationFeature ExtractionNetwork StructureLoss FunctionEER (%)t-DCF
Light convolutional neural network with feature genuinization for detection of synthetic speech attacks
Paper
CQT-based LPSLCNNLA: 4.07 (11)LA: 0.102 (10)
Siamese convolutional neural network using gaussian probability feature for spoofing speech detection
Paper
LFCCSiamese CNNCELA: 3.79 (10)
PA: 7.98 (5)
LA: 0.093 (5)
PA: 0.195 (2)
Generalization of audio deepfake detection
Paper
RIR and MUSANLFBResNet18LCMLLA: 1.81 (4)LA: 0.052 (4)
Continual learning for fake audio detection
Paper
LFCCLCNN, DFWFSimilarity LossLA: 7.74 (15)
PA: 8.85 (6)
Partially-connected differentiable architecture search for deepfake and spoofing detection
Paper Code
Frequency MaskLFCCPC-DARTSWCELA: 4.96 (12)LA: 0.091 (8)
One-class learning towards synthetic voice spoofing detection
Paper Code
LFCCResNet18OC-SoftmaxLA: 2.19 (7)LA: 0.059 (5)
Replay and synthetic speech detection with res2net architecture
Paper Code
CQTSE-Res2Net50BCELA: 2.50 (8)
PA: 0.46 (2)
LA: 0.074 (7)
PA: 0.012 (2)
An empirical study on channel effects for synthetic voice spoofing countermeasure systems
Paper Code
Telephone Codecs, and Device/Room Impulse Responses (IRs).LFCCLCNN, ResNet-OCOC-Softmax, CELA: 3.92 (10)
Efficient attention branch network with combined loss function for automatic speaker verification spoof detection
Paper Code
SpecAug, Attention MaskLFCCEfficientNet-A0, SE-Res2Net50WCE, Triplet LossLA: 1.89 (6)
PA: 0.86 (4)
LA: 0.507 (11)
PA: 0.024 (4)
Resmax: Detecting voice spoofing attacks with residual network and max feature map
Paper
CQTResMaxBCELA: 2.19 (7)
PA: 0.37 (1)
LA: 0.060 (6)
PA: 0.009 (1)
Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture
Paper
Adding noise according to a signal-to-noise ratio of 15dB or 25dBCQTSE-Res2Net34-ConfromerCELA: 1.85 (5)LA: 0.060 (6)
Fastaudio: A learnable audio front-end for spoof speech detection
Paper Code
L-VQTL-DenseNetNLLLossLA: 1.54 (3)LA: 0.045 (3)
Learning from yourself: A self-distillation method for fake speech detection
Paper
LPS, F0ECANet, SENetA-SoftmaxLA: 1.00 (2)
PA: 0.65 (3)
LA: 0.031 (2)
PA: 0.017 (3)
How to boost anti-spoofing with x-vectors
Paper
LFCC, MFCCTDNN, SENet34LCMLLA: 0.83 (1)LA: 0.024 (1)
421 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario. 422 | 423 | 424 | ## End-to-end Forgery Detection 425 | [**⬆ Back to top**](#table-of-contents) 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 |
PaperAudio Deepfake DetectionResults
Data AugmentationFeature ExtractionNetwork StructureLoss FunctionEER (%)t-DCF
A light convolutional GRU-RNN deep feature extractor for asv spoofing detection
Paper
LC-GRNNPLDALA: 6.28 (13)
PA: 2.23
LA: 0.152 (10)
PA: 0.061
Rw-resnet: A novel speech anti-spoofing model using raw waveform
Paper
1D Convolution Residual BlockResNetCELA: 2.98 (11)LA: 0.082 (9)
Raw differentiable architecture search for speech deepfake and spoofing detection
Paper Code
Masking FilterSinc FilterPC-DARTSP2SGradLA: 1.77 (10)LA: 0.052 (7)
Towards end-to-end synthetic speech detection
Paper Code
DNNRes-TSSDNet, Inc-TSSDNetWCELA: 1.64 (9)LA: 0.048 (6)
End-to-end anti-spoofing with RawNet2
Paper Code
Sinc FilterRawNet2CELA: 1.12 (5)LA: 0.033 (3)
Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection
Paper
FastAudio filterX-vector, ECAPA-TDNNLA: 1.54 (7)LA: 0.045 (5)
Fully automated end-to-end fake audio detection
Paper
Sinc FilterWav2Vec2light-DARTSComparative lossLA: 1.08 (4)
Audio anti-spoofing using a simple attention module and joint optimization based on additive angular margin loss and meta-learning
Paper
Sinc FilterRawNet2, SimAMAAM Softmax, MSELA: 0.99 (3)LA: 0.029 (2)
AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks
Paper Code
Sinc FilterRawNet2, MGO, HS-GALCELA: 0.83 (2)LA: 0.028 (1)
Ai-synthesized voice detection using neural vocoder artifacts
Paper Code
Resampling, Noise AdditionSinc FilterRawNet2CE, SoftmaxLA: 4.54 (12)
To-RawNet: Improving rawnet with tcn and orthogonal regularization for fake audio detection
Paper
RawBoostSinc FilterRawNet2, TCNCE, Orthogonal LossLA: 1.58 (8)
Speaker-Aware Anti-spoofing
Paper
Sinc FilterAASIST, M2S ConverterCELA: 1.13 (6)LA: 0.038 (4)
Spoofing attacker also benefits from self-supervised pretrained model
Paper
HuBERT, WavLMResidual block, Conv-TasNetAAM softmaxLA: 0.44 (1)
557 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario. 558 | 559 | 560 | ## Feature Fusion-based Forgery Detection 561 | [**⬆ Back to top**](#table-of-contents) 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 |
PaperAudio Deepfake DetectionResults
Feature ExtractionNetwork StructureLoss FunctionEER (%)
Voice spoofing countermeasure for synthetic speech detection
Paper
GTCC, MFCC, Spectral Flux, Spectral CentroidBi-LSTMLA: 3.05 (4)
Combining automatic speaker verification and prosody analysis for synthetic speech detection
Paper
MFCC, Mel-SpectrogramECAPA-TDNN, Prosody EncoderBCELA: 5.39 (5)
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation
Paper
Sinc Filter, Wav2Vec2AASISTContrastive Loss, WCE
Overlapped frequency-distributed network: Frequency-aware voice spoofing countermeasure
Paper
Mel-Spectrogram, CQTLCNN, ResNetLA: 1.35 (2)
PA: 0.35
Detection of cross-dataset fake audio based on prosodic and pronunciation features
Paper
Phoneme Feature, Prosody Feature, Wav2Vec2LCNN, Bi-LSTMCTCLA: 1.58 (3)
Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion
Paper Code
Sinc FilterAASIST, M2S ConverterCELA: 1.34 (1)
617 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario. 618 | 619 | # Network Training 620 | ## Multi-task Learning-based Forgery Detection 621 | [**⬆ Back to top**](#table-of-contents) 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 |
PaperAudio Deepfake DetectionResults
Feature ExtractionNetwork StructureLoss FunctionEER (%)t-DCF
Multi-task learning in utterance-level and segmental-level spoof detection
Paper
LFCCSELCNN, Bi-LSTMP2SGrad
SA-SASV: An end-to-end spoof-aggregated spoofing-aware speaker verification system
Paper Code
Fbanks, Sinc FilterECAPA-TDNN, ARawNetBCE, AAM Softmax, CELA: 4.86 (4)
STATNet: Spectral and temporal features based multi-task network for audio spoofing detection
Paper
Sinc FilterRawNet2, TCM, SCMCELA: 2.45 (3)LA: 0.062 (2)
A probabilistic fusion framework for spoofing aware speaker verification
Paper Code
Mel Filter, Sinc FilterECAPA-TDNN, AASISTBCELA: 1.53 (2)
DSVAE: Interpretable disentangled representation for synthetic speech detection
Paper
SpectrogramVAEKL Divergence Loss, BCELA: 6.56 (5)
End-to-end dual-branch network towards synthetic speech detection
Paper Code
LFCC, CQTDual-Branch NetworkClassification Loss, Fake Type Classification LossLA: 0.80 (1)LA: 0.021 (1)
684 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario. 685 | 686 | # Reference 687 | [**⬆ Back to top**](#table-of-contents) 688 | 689 | If you find our repository useful for your research, please cite it as follows: 690 | ``` 691 | @article{xu2024, 692 | title={Research progress on speech deepfake and its detection techniques}, 693 | author={Xu Yuxiong, Li Bin, Tan Shunquan, Huang Jiwu}, 694 | journal={Journal of Image and Graphics}, 695 | volume={29}, 696 | number={8}, 697 | pages={2236--2268}, 698 | year={2024} 699 | } 700 | ``` 701 | 702 | # Statement 703 | [**⬆ Back to top**](#table-of-contents) 704 | 705 | The purpose of this project is to establish a database based on audio deepfake detection, solely for the purpose of communication and learning. All the content collected in this project is sourced from journals and the internet, and we express sincere gratitude to the researchers and authors who have published related research achievements. In the event of a complaint of copyright infringement, the content will be removed as appropriate. 706 | 707 | 709 | 710 | 711 | --------------------------------------------------------------------------------