└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | > [!IMPORTANT]
2 | >
3 | > A curated collection of papers and resources on Audio Deepfake Detection (ADD).
4 | >
5 | > Please refer to our survey [**"Research progress on speech deepfake and its detection techniques"**](https://www.cjig.cn/en/article/doi/10.11834/jig.230476/?viewType=HTML) for the detailed contents. [](https://www.cjig.cn/en/article/doi/10.11834/jig.230476/?viewType=HTML)
6 | >
7 | > Please let us know if you discover any mistakes or have suggestions by emailing us: xuyuxiong2022@email.szu.edu.cn
8 |
9 | # Table of contents
10 |
11 | - [Table of contents](#table-of-contents)
12 | - [What's New](#whats-new)
13 | - [Survey](#survey)
14 | - [Top Repositories](#top-repositories)
15 | - [Audio Large Model](#audio-large-model)
16 | - [Datasets](#datasets)
17 | - [Audio Preprocessing](#audio-preprocessing)
18 | - [Commonly Used Noise Datasets](#commonly-used-noise-datasets)
19 | - [Audio Enhancement Methods](#audio-enhancement-methods)
20 | - [Feature Extraction](#feature-extraction)
21 | - [Handcrafted Feature-based Forgery Detection](#handcrafted-feature-based-forgery-detection)
22 | - [Hybrid Feature-based Forgery Detection](#hybrid-feature-based-forgery-detection)
23 | - [End-to-end Forgery Detection](#end-to-end-forgery-detection)
24 | - [Feature Fusion-based Forgery Detection](#feature-fusion-based-forgery-detection)
25 | - [Network Training](#network-training)
26 | - [Multi-task Learning-based Forgery Detection](#multi-task-learning-based-forgery-detection)
27 | - [Reference](#reference)
28 | - [Statement](#statement)
29 |
30 | # What's New
31 | - [Update Jun. 3, 2025] 🔥🔥🔥 Updated [Datasets](#Datasets), [Survey](#survey), and [Top Repositories](#top-repositories).
32 | - [Update Feb. 20, 2025] Updated [Datasets](#Datasets), Added [Survey](#survey) and [Top Repositories](#top-repositories).
33 |
34 | # Survey
35 | [**⬆ Back to top**](#table-of-contents)
36 | - `【2022-04】-【ADD-Survey】-【Ben Gurion University】`
37 | - **A study on data augmentation in voice anti-spoofing**
38 | - **Author(s):** Ariel Cohen, Inbal Rimon, Eran Aflalo, Haim H. Permuter
39 | - [Paper](https://doi.org/10.1016/j.specom.2022.04.005) [Code](https://github.com/InbalRim/A-Study-On-Data-Augmentation-In-Voice-Anti-Spoofing)
40 | - `【2022-05】-【ADD-Survey】-【King Saud University】`
41 | - **A review of modern audio deepfake detection methods: challenges and future directions**
42 | - **Author(s):** Zaynab Almutairi, and Hebah Elgibreen
43 | - [Paper](https://doi.org/10.3390/a15050155)
44 | - `【2023-01】-【ADD-Survey】-【University of Maryland Baltimore County】`
45 | - **Audio deepfakes: A survey**
46 | - **Author(s):** Zahra Khanjani, Gabrielle Watson and Vandana P. Janeja
47 | - [Paper](10.3389/fdata.2022.1001063)
48 | - `【2023-02】-【ADD-Survey】-【Oakland University】`
49 | - **Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures**
50 | - **Author(s):** Awais Khan, Khalid Mahmood Malik, James Ryan, and Mikul Saravanan
51 | - [Paper](https://doi.org/10.1007/s10462-023-10539-8) [Code](https://github.com/smileslab/Comparative-Analysis-Voice-Spoofing)
52 | - `【2023-04】-【ADD-Survey】-【Panjab University】`
53 | - **Review of audio deepfake detection techniques: Issues and prospects**
54 | - **Author(s):** Abhishek Dixit, Nirmal Kaur, Staffy Kingra
55 | - [Paper](https://doi.org/10.1111/exsy.13322)
56 | - `【2023-07】-【ADD-Survey】-【Indian Institute of Technology】`
57 | - **Uncovering the deceptions: an analysis on audio spoofing detection and future prospects**
58 | - **Author(s):** Rishabh Ranjan, Mayank Vatsa, Richa Singh
59 | - [Paper](https://doi.org/10.24963/ijcai.2023/756)
60 | - `【2023-08】-【ADD-Survey】-【CASIA】`
61 | - **Audio deepfake detection: a survey**
62 | - **Author(s):** Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao
63 | - [Paper](https://arxiv.org/abs/2308.14970)
64 | - `【2024-02】-【Multimodal-Survey】-【Purdue University, Nanchang University】`
65 | - **Detecting Multimedia Generated by Large AI Models: A Survey**
66 | - **Author(s):** Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu
67 | - [Paper](https://arxiv.org/abs/2402.00045) [Code](https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey)
68 | - `【2024-04】-【ADD-Survey】-【Toronto Metropolitan University】`
69 | - **A Survey on Speech Deepfake Detection**
70 | - **Author(s):** Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang
71 | - [Paper](https://doi.org/10.1145/3714458)
72 | - `【2024-05】-【Multimodal-Survey】-【Cool Large Language Models Research Group, Renmin University of China】`
73 | - **Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities**
74 | - **Author(s):** Xiaomin Yua, Yezhaohui Wanga, Yanfang Chenb, Zhen Tao, Dinghao Xi, Shichao Song, and Simin Niu
75 | - [Paper](https://arxiv.org/abs/2405.00711)
76 | - `【2024-09】-【ADD-Survey】-【Austrian Institute of Technology】`
77 | - **A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection**
78 | - **Author(s):** Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Huyen Nguyen, Alexander Schindler, Hai Canh Vu
79 | - [Paper](https://doi.org/10.1016/j.cosrev.2025.100757) [Code](https://github.com/AI-ResearchGroup/A-Comprehensive-Survey-with-Critical-Analysis-for-Deepfake-Speech-Detection)
80 | - `【2024-11】-【Multimodal-Survey】-【University of Bucharest】`
81 | - **Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook**
82 | - **Author(s):** Forinel-Alin Croitoru, Andrei-Iulian Hıˆji, Vlad Hondru, Nicolae Cat ̆alin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah
83 | - [Paper](https://arxiv.org/abs/2411.19537) [Code](https://github.com/CroitoruAlin/biodeep)
84 | - `【2024-11】-【Multimodal-Survey】-【University College Dublin】`
85 | - **Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey**
86 | - **Author(s):** Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
87 | - [Paper](https://arxiv.org/abs/2411.17911)
88 | - `【2024-12】-【ADD-Survey】-【Imperial College London, Technical University of Munich】`
89 | - **From Audio Deepfake Detection to AI-Generated Music Detection——A Pathway and Overview**
90 | - **Author(s):** Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
91 | - [Paper](https://arxiv.org/abs/2412.00571)
92 | - `【2025-02】-【Multimodal-Survey】-【BUPT, University of California, CASIA】`
93 | - **Survey on AI-Generated Media Detection: From Non-MLLM to MLLM**
94 | - **Author(s):** Yueying Zou, Peipei Li, Zekun Li, Huaibo Huang, Xing Cui, Xuannan Liu, Chenghanyu Zhang, Ran He
95 | - [Paper](https://arxiv.org/abs/2502.05240)
96 |
97 | # Top Repositories
98 | [**⬆ Back to top**](#table-of-contents)
99 | - **Audio Large Language Models**
100 | - Resources on Audio Large Language Models, including datasets, methods, benchmarks, and studies.
101 | - [Repo](https://github.com/AudioLLMs/Awesome-Audio-LLM)
102 | - **awesome-fake-audio-detection**
103 | - A list of tools, papers and code related to Fake Audio Detection.
104 | - [Repo](https://github.com/john852517791/awesome-fake-audio-detection?tab=readme-ov-file)
105 | - **ASVspoof Challenge Official Repository**
106 | - Sub-repositories include [ASVspoof 5](https://github.com/asvspoof-challenge/asvspoof5), [ASVspoof 2021](https://github.com/asvspoof-challenge/2021), and [classifier-adjacency](https://github.com/asvspoof-challenge/classifier-adjacency).
107 | - [Repo](https://github.com/asvspoof-challenge)
108 |
109 | # Audio Large Model
110 | [**⬆ Back to top**](#table-of-contents)
111 | | Model | Publisher | Years| Achievable Tasks |
112 | |:----:|:-------------:|:--------------:|:--------------------|
113 | | AudioLM
[Paper](https://arxiv.org/pdf/2209.03143) [Website](https://google-research.github.io/seanet/audiolm/examples/) [Code](https://github.com/lucidrains/audiolm-pytorch) | Google | 2022.09 |1. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space.
2. Speech continuation, Acoustic generation, Unconditional generation, Generation without semantic tokens, and Piano continuation. |
114 | | VALL-E
[Paper](https://arxiv.org/abs/2301.02111) [Website](https://www.microsoft.com/en-us/research/project/vall-e-x/) | Microsoft | 2023.01|1. Simply record a 3-second registration of an unseen speaker to create a high-quality personalised speech.
2. VALL-E X: Cross-lingual speech synthesis. |
115 | | USM
[Website](https://sites.research.google/usm/) | Google | 2023.03| 1. ASR beyond 100 languages.
2. Downstream ASR tasks.
3. Automated Speech Translation (AST). |
116 | | SpeechGPT
[Website](https://github.com/0nutation/SpeechGPT) | Fudan University |2023.05| 1. Perceive and generate multi-modal contents.
2. Spoken dialogue LLM with strong human instruction. |
117 | | Pengi
[Paper](https://arxiv.org/pdf/2305.11834.pdf) [Website](https://github.com/microsoft/Pengi) | Microsoft |2023.05| 1. an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks.
2. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. |
118 | | VoiceBox
[Website](https://voicebox.metademolab.com/) | Meta | 2023.06| 1. Synthesize speech across six languages.
2. Remove transient noise.
3. Edit content.
4. Transfer audio style within and across languages.
5. Generate diverse speech samples. |
119 | | AudioPaLM
[Paper](https://arxiv.org/pdf/2306.12925) [Website](https://google-research.github.io/seanet/audiopalm/examples) | Google | 2023.06|1. Speech-to-speech translation.
2. Automatic Speech Recognition (ASR). |
120 |
121 | # Datasets
122 | [**⬆ Back to top**](#table-of-contents)
123 | | Attack Types | Years| Dataset | Number of Audio
(Subdataset:Real/Fake) | Language |
124 | |:-----------:|:------------:|:------------:|:-------------:|:------------:|
125 | |TTS|2019|FoR
[Paper](https://ieeexplore.ieee.org/document/8906599) [Dataset](http://bil.eecs.yorku.ca/datasets)|111,000/87,000|English|
126 | |TTS|2021|WaveFake
[Paper](https://arxiv.org/abs/2111.02813) [Dataset](https://zenodo.org/record/5642694)|16,283/117,985|English, Japanese|
127 | |TTS|2021|Half-truth
[Paper](https://arxiv.org/abs/2104.03617)|53,612/107,224|Chinese|
128 | |TTS|2021|HAD
[Paper](https://arxiv.org/abs/2104.03617)|53,612/107,224|Chinese|
129 | |TTS|2021|FakeAVCeleb
[Paper](https://arxiv.org/pdf/2108.05080) [Dataset](https://github.com/DASH-Lab/FakeAVCeleb/tree/main)|10,209/11,335|English|
130 | |TTS|2022|ADD 2022
[Paper](https://arxiv.org/pdf/2202.08433v2.pdf)|LF: 5,619/46,067
PF: 5,319/46,419
FG-D: 5,319/46,206|Chinese|
131 | |TTS|2022|CMFD
[Paper](https://ieeexplore.ieee.org/document/9921343) [Dataset](https://github.com/WuQinfang/CMFD)|Chinese: 1,800/1,000
English: 1,800/1,000|English, Chinese|
132 | |TTS|2022|In-the-Wild
[Paper](https://arxiv.org/abs/2203.16263) [Dataset](https://deepfake-demo.aisec.fraunhofer.de/in_the_wild)|19,963/11,816|English|
133 | |TTS|2022|CFAD
[Paper](https://arxiv.org/abs/2207.12308) [Dataset](https://zenodo.org/record/8122764)|38,600/77,200|Chinese|
134 | |TTS|2022|Psynd
[Paper](https://ieeexplore.ieee.org/document/9956134) [Dataset](https://scholarbank.nus.edu.sg/handle/10635/227398)|2,294|English|
135 | |TTS|2022|TIMIT-TTS
[Paper](https://arxiv.org/abs/2209.08000) [Dataset](https://zenodo.org/records/6560159)|0/79,120|English|
136 | |TTS|2023|ODSS
[Paper](https://ieeexplore.ieee.org/abstract/document/10374863) [Dataset](https://zenodo.org/records/8370669)|11,032/18,993|English, German,
and Spanish|
137 | |TTS|2024|MLAAD
[Paper](https://arxiv.org/abs/2401.09512) [Dataset](https://deepfake-total.com/mlaad)|-/76,000|Multi-lingual|
138 | |TTS|2024|CD-ADD
[Paper](https://arxiv.org/abs/2404.04904)|300 hours|English|
139 | |TTS|2024|DiffSSD
[Paper](https://arxiv.org/abs/2409.13049) [Dataset](https://huggingface.co/datasets/purdueviperlab/diffssd)|24,226/70,000|English|
140 | |TTS|2024|ACCENT
[Paper](https://arxiv.org/abs/2409.08346) |53,651/192,461|Multi-lingual|
141 | |TTS|2024|SpoofCeleb
[Paper](https://arxiv.org/abs/2409.17285) [Dataset](http://www.jungjee.com/spoofceleb/)|250k+|English|
142 | |TTS|2024|LlamaPartialSpoof
[Paper](https://arxiv.org/abs/2409.14743) [Dataset](https://github.com/hieuthi/LlamaPartialSpoof)|10,573/33,479|English|
143 | |TTS|2024|FakeSound
[Paper](https://arxiv.org/abs/2406.08052) [Dataset](https://github.com/FakeSoundData/FakeSound)|-/3,798|English|
144 | |TTS|2024|DFADD
[Paper](https://arxiv.org/abs/2409.08731) [Dataset](https://github.com/isjwdu/DFADD)|44,455/163,500|English|
145 | |TTS|2024|SONAR
[Paper](https://arxiv.org/abs/2410.04324) [Dataset](https://github.com/Jessegator/SONAR)|-/2,274|Multi-lingual|
146 | |TTS|2025|6KSFx
[Paper](https://arxiv.org/abs/2501.17198) [Dataset](https://github.com/nellyngz95/6KSFX)|-/6,000|-|
147 | |TTS|2025|BangalFake
[Paper](https://arxiv.org/html/2505.10885) [Dataset](https://huggingface.co/datasets/sifat1221/banglaFake)|12,260/13,260|Bengali|
148 | |Replay|2017|ASVspoof 2017
[Paper](https://www.asvspoof.org/asvspoof2017overview_cameraReady.pdf) [Dataset](https://datashare.ed.ac.uk/handle/10283/3055)|3,565/14,465|English|
149 | |Replay|2019|ReMASC
[Paper](https://arxiv.org/abs/1904.03365) [Dataset](https://github.com/YuanGongND/ReMASC)|9,240/45,472|English, Chinese,
Hindi|
150 | |Replay|2019|VSDC
[Paper](https://arxiv.org/abs/1909.00935) [Dataset](http://www.secs.oakland.edu/∼mahmood/datasets/audiospoof)|1,687/11,772|English|
151 | |Replay|2024|POLIPHONE
[Paper](https://arxiv.org/abs/2410.06221) [Dataset](https://zenodo.org/records/13903412)|41,941|English|
152 | |TTS
and VC|2015|AVspoof
[Paper](https://ieeexplore.ieee.org/document/7358783) [Dataset](https://zenodo.org/record/4081040)|LA: 15,504/120,480
PA: 15,504/14,465|English|
153 | |TTS
and VC|2015|ASVspoof 2015
[Paper](https://datashare.ed.ac.uk/bitstream/handle/10283/853/ASVspoof2015_evaluation_plan.pdf?sequence=2&isAllowed=y) [Dataset](http://dx.doi.org/10.7488/ds/298)|16,651/246,500|English|
154 | |TTS
and VC|2021|FMFCC-A
[Paper](https://arxiv.org/abs/2110.09441) [Dataset](https://github.com/Amforever/FMFCC-A)|10,000/40,000|Chinese|
155 | |TTS
and VC|2022|SceneFake
[Paper](https://arxiv.org/abs/2211.06073) [Dataset](https://zenodo.org/record/7663324#.Y_XKMuPYuUk)|19,838/64,642|English|
156 | |TTS
and VC|2022|EmoFake
[Paper](http://arxiv.org/abs/2211.05363)|35,000/53,200|English, Chinese|
157 | |TTS
and VC|2023|PartialSpoof
[Paper](https://arxiv.org/abs/2104.02518) [Dataset](https://zenodo.org/record/4817532#.YLd8Yi2l1hF)|12,483/108,978|English|
158 | |TTS
and VC|2023|ADD 2023
[Paper](https://arxiv.org/abs/2305.13774)|FG-D: 172,819/113,042
RL: 55,468/65,449
AR: 14,907/95,383|Chinese|
159 | |TTS
and VC|2023|DECRO
[Paper](https://dl.acm.org/doi/10.1145/3543507.3583222) [Dataset](https://github.com/petrichorwq/DECRO-dataset)|Chinese: 21,218/41,880
English: 12,484/42,799|English, Chinese|
160 | |TTS
and VC|2023|HABLA
[Paper](10.21437/Interspeech.2023-2272) [Dataset](https://github.com/Ruframapi/HABLA)|22,000/58,000|Spanish|
161 | |TTS
and VC|2024|DeepFakeVox-HQ
[Paper](https://ieeexplore.ieee.org/document/10800535)|693k/643k|English|
162 | |TTS
and VC|2024|VoiceWukong
[Paper](https://arxiv.org/abs/2409.06348) [Dataset](https://voicewukong.github.io)|5,300/413,400|English, Chinese|
163 | |TTS
and VC|2024|Speech-Forensics
[Paper](https://www.ijcai.org/proceedings/2024/0046) [Dataset](https://github.com/ring-zl/Speech-Forensics)|13,100/7,452|English|
164 | |TTS
and VC|2024|VoiceEdit
[Paper](https://arxiv.org/abs/2402.06304)|-|Multi-lingual|
165 | |TTS
and VC|2024|RFP
[Paper](https://arxiv.org/abs/2404.17721) [Dataset](https://zenodo.org/records/10202142)|28,115/74,199|English|
166 | |TTS
and VC|2025|MADD
[Paper](https://ieeexplore.ieee.org/document/10800535)|60,000/129,990|Multi-lingual|
167 | |TTS
and VC|2025|XMAD-Bench
[Paper](https://arxiv.org/pdf/2506.00462) [Dataset](https://github.com/ristea/xmad-bench/)|414,858|Multi-lingual|
168 | |TTS
and Vocoder|2024|Diffuse or Confuse
[Paper](https://arxiv.org/abs/2410.06796) [Dataset](https://github.com/AntonFirc/diffusion-deepfake-speech-dataset/)|131,000/183,400|English|
169 | |TTS
and Vocoder|2025|ShiftySpeech
[Paper](https://arxiv.org/abs/2502.05674) [Dataset](https://huggingface.co/datasets/ash56/ShiftySpeech)|3,000+ hours|English, Chinese,
Japanese|
170 | |TTS, VC
and Replay|2019|ASVspoof 2019
[Paper](https://arxiv.org/abs/1904.05441) [Dataset](https://datashare.ed.ac.uk/handle/10283/3336)|LA: 12,483/108,978
PA: 28,890/189,540|English|
171 | |TTS, VC
and Replay|2021|ASVspoof 2021
[Paper](https://arxiv.org/abs/2109.00535) [Dataset](https://www.asvspoof.org/index2021.html)|LA: 18,452/163,114
PA: 126,630/816,480
PF: 14,869/519,059|English|
172 | |TTS, VC
and Vocoder|2024|SpeechFake
[Paper](https://openreview.net/forum?id=GpUO6qYNQG)|3,000+ hours|English|
173 | |VC, Replay
and Adversarial|2024|VSASV
[Paper](https://www.isca-archive.org/interspeech_2024/hoang24b_interspeech.html) [Dataset](https://github.com/hustep-lab/VSASV-Dataset)|164,000/174,000|Multi-lingual|
174 | |TTS, VC
and Adversarial|2024|ASVspoof 5
[Paper](https://arxiv.org/abs/2502.08857) [Dataset](https://zenodo.org/records/14498691)|188,819/ 815,262|English|
175 | |Voice Cloning|2021|RTVCSpoof
[Paper](https://ojs.aaai.org/index.php/AAAI/article/view/6044/5900)|3,284/4,843|English|
176 | |Voice Cloning|2024|Kratika Dataset
[Paper](https://doi.org/10.1145/3658664.3659658)|24,226/25,000|English|
177 | |Vocoder|2022|Yan Dataset
[Paper](https://doi.org/10.1145/3552466.3556525)|8,200/63,200|Chinese|
178 | |Vocoder|2023|LibriSeVoc
[Paper](https://arxiv.org/abs/2304.13085) [Dataset](https://github.com/csun22/SyntheticVoice-Detection-Vocoder-Artifacts)|13,201/79,206|English|
179 | |Vocoder|2023|Voc.v1-v4
[Paper](https://arxiv.org/abs/2210.10570) [Dataset](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/09-asvspoof-vocoded-trn)|2,580/10,320|English|
180 | |Vocoder|2024|MLADDC
[Paper](https://openreview.net/forum?id=ic3HvoOTeU) [Dataset](https://speech007.github.io/MLADDC_Nips/)|80k/160k|Multi-lingual|
181 | |Vocoder|2024|CVoiceFake
[Paper](https://dl.acm.org/doi/10.1145/3658644.3670285) [Dataset](https://safeearweb.github.io/Project/)|23,544/91,700|Multi-lingual|
182 | |Impersonation|2024|IPAD
[Paper](https://arxiv.org/abs/2408.17009)|5,170/18,874|Chinese|
183 | |Text-To-Music (TTM)|2024|FSD
[Paper](https://arxiv.org/abs/2309.02232) [Dataset](https://github.com/xieyuankun/FSD-Dataset)|200/500|Chinese|
184 | |Text-To-Music (TTM)|2024|SingFake
[Paper](https://arxiv.org/abs/2309.07525) [Dataset](https://github.com/yongyizang/SingFake)|634/671|Multi-lingual|
185 | |Text-To-Music (TTM)|2024|CtrSVDD
[Paper](https://arxiv.org/abs/2406.02438) [Dataset](https://github.com/SVDDChallenge/)|32,312/188,486|Multi-lingual|
186 | |Text-To-Music (TTM)|2024|FakeMusicCaps
[Paper](https://arxiv.org/abs/2409.10684) [Dataset](https://zenodo.org/records/13732524)|5.5k/27,605|English|
187 | |Text-To-Music (TTM)|2025|SONICS
[Paper](https://arxiv.org/abs/2408.14080)|48,090/49,074|English|
188 | |Text-To-Music (TTM)|2025|SingNet
[Paper](hhttps://arxiv.org/pdf/2505.09325) [Dataset](https://singnet-dataset.github.io/)|2,963.4 hours|Multi-lingual|
189 | |Codec-based Speech
Generation (CoSG)|2024|CodecFake
[Paper](https://arxiv.org/abs/2406.07237) [Dataset](https://codecfake.github.io/)|42,752/45,045|English|
190 | |Codec-based Speech
Generation (CoSG)|2024|ALM-ADD
[Paper](https://arxiv.org/abs/2408.10853) [Dataset](https://github.com/xieyuankun/ALM-ADD)|123/210|English|
191 | |Codec-based Speech
Generation (CoSG)|2024|Codecfake
[Paper](https://ieeexplore.ieee.org/abstract/document/10830534) [Dataset](https://zenodo.org/records/13838106)|132,277/925,939|English, Chinese|
192 | |Codec-based Speech
Generation (CoSG)|2025|ST-Codecfake
[Paper](https://arxiv.org/pdf/2501.06514) [Dataset](https://zenodo.org/records/14631091)|13,228/145,778|English, Chinese|
193 | |Codec-based Speech
Generation (CoSG)|2025|CodecFake+
[Paper](https://arxiv.org/abs/2501.08238)|90,163/1,423,894|English|
194 |
195 |
196 |
197 |
198 | # Audio Preprocessing
199 |
200 | ## Commonly Used Noise Datasets
201 | [**⬆ Back to top**](#table-of-contents)
202 | | Dataset | Description |
203 | |:--------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
204 | | MUSAN
[Dataset](https://www.openslr.org/17/) | A corpus of music, speech and noise |
205 | | RIR
[Dataset](https://www.openslr.org/28/) | A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision. |
206 | | NOIZEUS
[Dataset](https://ecs.utdallas.edu/loizou/speech/noizeus/) | Contains 30 IEEE sentences (generated by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. Noises include suburban train noise, murmur, car, exhibition hall, restaurant, street, airport and train station noise. |
207 | | NoiseX-92
[Dataset](http://spib.linse.ufsc.br/noise.html) | All noises are obtained with a duration of 235 seconds, a sampling rate of 19.98 KHz, an analogue-to-digital converter (A/D) with 16 bits, an anti-alias filter and no pre-emphasis stage. Fifteen noise types are included. |
208 | | DEMAND
[Dataset](https://zenodo.org/record/1227121#.YXtsyPlBxjU) | Multi-channel acoustic noise database for diverse environments. |
209 | | ESC-50
[Dataset](https://github.com/karolpiczak/ESC-50) | A tagged collection of 2000 environmental audios obtained from clips in Freesound.org, suitable for environmental sound classification. The dataset consists of 5-second-long recordings organised into 5 broad categories, each with 10 subcategories (40 examples per subcategory). |
210 | | ESC
[Dataset](http://shujujishi.com/dataset/69b2bf03-d855-4f8b-ab96-1ec80e285863.html) | Including the ESC-50, ESC-10, and ESC-US. |
211 | | FSD50K
[Dataset](https://zenodo.org/record/4060432#.Y1kvcHZByUk) | An open dataset of human tagged sound events containing 51,197 Freesound clips totalling 108.3 hours of multi-labeled audio, unequally distributed across 200 classes from the AudioSet Ontology. |
212 |
213 | ## Audio Enhancement Methods
214 | [**⬆ Back to top**](#table-of-contents)
215 | | Method | Description |
216 | |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------:|
217 | | SpecAugment
[Paper](https://arxiv.org/pdf/1904.08779.pdf) [Code](https://github.com/DemisEom/SpecAugment) | Enhancement strategies include time warping, frequency masking and time masking |
218 | | WavAugment
[Paper](https://arxiv.org/abs/2007.00991) [Code](https://github.com/facebookresearch/WavAugment) | Enhancement strategies include pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject and clipping |
219 | | RawBoost
[Paper](https://arxiv.org/abs/2007.00991) [Code](https://github.com/TakHemlata/RawBoost-antispoofing) | Enhancement strategies include linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise |
220 |
221 | # Feature Extraction
222 |
223 | ## Handcrafted Feature-based Forgery Detection
224 | [**⬆ Back to top**](#table-of-contents)
225 |
Paper | 228 |Audio Deepfake Detection | 229 |Results | 230 |||||
Data Augmentation | 233 |Feature Extraction | 234 |Network Framework | 235 |Loss Function | 236 |EER (%) | 237 |t-DCF | 238 ||
Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge Paper Code |
241 | — | 242 |CQT, Power Spectrum | 243 |VGG, SincNet | 244 |CE | 245 |LA: 8.01 (4) PA: 1.51 (2) |
246 | LA: 0.208 (4) PA: 0.037 (1) |
247 |
Long-term high frequency features for synthetic speech detection Paper |
250 | Cafe, White and Street Noise | 251 |ICQC, ICQCC, ICBC, ICLBC | 252 |DNN | 253 |CE | 254 |LA: 7.78 (3) | 255 |LA: 0.187 (3) | 256 |
Voice spoofing countermeasure for logical access attacks detection Paper |
259 | — | 260 |ELTP-LFCC | 261 |DBiLSTM | 262 |— | 263 |LA: 0.74 (1) | 264 |LA: 0.008 (1) | 265 |
Voice spoofing detector: A unified anti-spoofing framework Paper |
268 | — | 269 |ATP-GTCC | 270 |SVM | 271 |Hamming Distance |
272 | LA: 0.75 (2) PA: 1.00 (1) |
273 | LA: 0.050 (2) PA: 0.064 (2) |
274 |
Paper | 283 |Audio Deepfake Detection | 284 |Results | 285 |||||
Data Augmentation | 288 |Feature Extraction | 289 |Network Structure | 290 |Loss Function | 291 |EER (%) | 292 |t-DCF | 293 ||
Light convolutional neural network with feature genuinization for detection of synthetic speech attacks Paper |
296 | — | 297 |CQT-based LPS | 298 |LCNN | 299 |— | 300 |LA: 4.07 (11) | 301 |LA: 0.102 (10) | 302 |
Siamese convolutional neural network using gaussian probability feature for spoofing speech detection Paper |
305 | — | 306 |LFCC | 307 |Siamese CNN | 308 |CE | 309 |LA: 3.79 (10) PA: 7.98 (5) |
310 | LA: 0.093 (5) PA: 0.195 (2) |
311 |
Generalization of audio deepfake detection Paper |
314 | RIR and MUSAN | 315 |LFB | 316 |ResNet18 | 317 |LCML | 318 |LA: 1.81 (4) | 319 |LA: 0.052 (4) | 320 |
Continual learning for fake audio detection Paper |
323 | — | 324 |LFCC | 325 |LCNN, DFWF | 326 |Similarity Loss | 327 |LA: 7.74 (15) PA: 8.85 (6) |
328 | — | 329 |
Partially-connected differentiable architecture search for deepfake and spoofing detection Paper Code |
332 | Frequency Mask | 333 |LFCC | 334 |PC-DARTS | 335 |WCE | 336 |LA: 4.96 (12) | 337 |LA: 0.091 (8) | 338 |
One-class learning towards synthetic voice spoofing detection Paper Code |
341 | — | 342 |LFCC | 343 |ResNet18 | 344 |OC-Softmax | 345 |LA: 2.19 (7) | 346 |LA: 0.059 (5) | 347 |
Replay and synthetic speech detection with res2net architecture Paper Code |
350 | — | 351 |CQT | 352 |SE-Res2Net50 | 353 |BCE | 354 |LA: 2.50 (8) PA: 0.46 (2) |
355 | LA: 0.074 (7) PA: 0.012 (2) |
356 |
An empirical study on channel effects for synthetic voice spoofing countermeasure systems Paper Code |
359 | Telephone Codecs, and Device/Room Impulse Responses (IRs). | 360 |LFCC | 361 |LCNN, ResNet-OC | 362 |OC-Softmax, CE | 363 |LA: 3.92 (10) | 364 |— | 365 |
Efficient attention branch network with combined loss function for automatic speaker verification spoof detection Paper Code |
368 | SpecAug, Attention Mask | 369 |LFCC | 370 |EfficientNet-A0, SE-Res2Net50 | 371 |WCE, Triplet Loss | 372 |LA: 1.89 (6) PA: 0.86 (4) |
373 | LA: 0.507 (11) PA: 0.024 (4) |
374 |
Resmax: Detecting voice spoofing attacks with residual network and max feature map Paper |
377 | — | 378 |CQT | 379 |ResMax | 380 |BCE | 381 |LA: 2.19 (7) PA: 0.37 (1) |
382 | LA: 0.060 (6) PA: 0.009 (1) |
383 |
Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture Paper |
386 | Adding noise according to a signal-to-noise ratio of 15dB or 25dB | 387 |CQT | 388 |SE-Res2Net34-Confromer | 389 |CE | 390 |LA: 1.85 (5) | 391 |LA: 0.060 (6) | 392 |
Fastaudio: A learnable audio front-end for spoof speech detection Paper Code |
395 | — | 396 |L-VQT | 397 |L-DenseNet | 398 |NLLLoss | 399 |LA: 1.54 (3) | 400 |LA: 0.045 (3) | 401 |
Learning from yourself: A self-distillation method for fake speech detection Paper |
404 | — | 405 |LPS, F0 | 406 |ECANet, SENet | 407 |A-Softmax | 408 |LA: 1.00 (2) PA: 0.65 (3) |
409 | LA: 0.031 (2) PA: 0.017 (3) |
410 |
How to boost anti-spoofing with x-vectors Paper |
413 | — | 414 |LFCC, MFCC | 415 |TDNN, SENet34 | 416 |LCML | 417 |LA: 0.83 (1) | 418 |LA: 0.024 (1) | 419 |
Paper | 429 |Audio Deepfake Detection | 430 |Results | 431 |||||
Data Augmentation | 434 |Feature Extraction | 435 |Network Structure | 436 |Loss Function | 437 |EER (%) | 438 |t-DCF | 439 ||
A light convolutional GRU-RNN deep feature extractor for asv spoofing detection Paper |
442 | — | 443 |LC-GRNN | 444 |PLDA | 445 |— | 446 |LA: 6.28 (13) PA: 2.23 |
447 | LA: 0.152 (10) PA: 0.061 |
448 |
Rw-resnet: A novel speech anti-spoofing model using raw waveform Paper |
451 | — | 452 |1D Convolution Residual Block | 453 |ResNet | 454 |CE | 455 |LA: 2.98 (11) | 456 |LA: 0.082 (9) | 457 |
Raw differentiable architecture search for speech deepfake and spoofing detection Paper Code |
460 | Masking Filter | 461 |Sinc Filter | 462 |PC-DARTS | 463 |P2SGrad | 464 |LA: 1.77 (10) | 465 |LA: 0.052 (7) | 466 |
Towards end-to-end synthetic speech detection Paper Code |
469 | — | 470 |DNN | 471 |Res-TSSDNet, Inc-TSSDNet | 472 |WCE | 473 |LA: 1.64 (9) | 474 |LA: 0.048 (6) | 475 |
End-to-end anti-spoofing with RawNet2 Paper Code |
478 | — | 479 |Sinc Filter | 480 |RawNet2 | 481 |CE | 482 |LA: 1.12 (5) | 483 |LA: 0.033 (3) | 484 |
Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection Paper |
487 | — | 488 |FastAudio filter | 489 |X-vector, ECAPA-TDNN | 490 |— | 491 |LA: 1.54 (7) | 492 |LA: 0.045 (5) | 493 |
Fully automated end-to-end fake audio detection Paper |
496 | Sinc Filter | 497 |Wav2Vec2 | 498 |light-DARTS | 499 |Comparative loss | 500 |LA: 1.08 (4) | 501 |— | 502 |
Audio anti-spoofing using a simple attention module and joint optimization based on additive angular margin loss and meta-learning Paper |
504 | — | 505 |Sinc Filter | 506 |RawNet2, SimAM | 507 |AAM Softmax, MSE | 508 |LA: 0.99 (3) | 509 |LA: 0.029 (2) | 510 |
AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks Paper Code |
513 | — | 514 |Sinc Filter | 515 |RawNet2, MGO, HS-GAL | 516 |CE | 517 |LA: 0.83 (2) | 518 |LA: 0.028 (1) | 519 |
Ai-synthesized voice detection using neural vocoder artifacts Paper Code |
522 | Resampling, Noise Addition | 523 |Sinc Filter | 524 |RawNet2 | 525 |CE, Softmax | 526 |LA: 4.54 (12) | 527 |— | 528 |
To-RawNet: Improving rawnet with tcn and orthogonal regularization for fake audio detection Paper |
531 | RawBoost | 532 |Sinc Filter | 533 |RawNet2, TCN | 534 |CE, Orthogonal Loss | 535 |LA: 1.58 (8) | 536 |— | 537 |
Speaker-Aware Anti-spoofing Paper |
540 | — | 541 |Sinc Filter | 542 |AASIST, M2S Converter | 543 |CE | 544 |LA: 1.13 (6) | 545 |LA: 0.038 (4) | 546 |
Spoofing attacker also benefits from self-supervised pretrained model Paper |
549 | — | 550 |HuBERT, WavLM | 551 |Residual block, Conv-TasNet | 552 |AAM softmax | 553 |LA: 0.44 (1) | 554 |— | 555 |
Paper | 565 |Audio Deepfake Detection | 566 |Results | 567 |||
Feature Extraction | 570 |Network Structure | 571 |Loss Function | 572 |EER (%) | 573 ||
Voice spoofing countermeasure for synthetic speech detection Paper |
576 | GTCC, MFCC, Spectral Flux, Spectral Centroid | 577 |Bi-LSTM | 578 |— | 579 |LA: 3.05 (4) | 580 |
Combining automatic speaker verification and prosody analysis for synthetic speech detection Paper |
583 | MFCC, Mel-Spectrogram | 584 |ECAPA-TDNN, Prosody Encoder | 585 |BCE | 586 |LA: 5.39 (5) | 587 |
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation Paper |
590 | Sinc Filter, Wav2Vec2 | 591 |AASIST | 592 |Contrastive Loss, WCE | 593 |— | 594 |
Overlapped frequency-distributed network: Frequency-aware voice spoofing countermeasure Paper |
597 | Mel-Spectrogram, CQT | 598 |LCNN, ResNet | 599 |— | 600 |LA: 1.35 (2) PA: 0.35 |
601 |
Detection of cross-dataset fake audio based on prosodic and pronunciation features Paper |
604 | Phoneme Feature, Prosody Feature, Wav2Vec2 | 605 |LCNN, Bi-LSTM | 606 |CTC | 607 |LA: 1.58 (3) | 608 |
Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion Paper Code |
611 | Sinc Filter | 612 |AASIST, M2S Converter | 613 |CE | 614 |LA: 1.34 (1) | 615 |
Paper | 625 |Audio Deepfake Detection | 626 |Results | 627 ||||
Feature Extraction | 630 |Network Structure | 631 |Loss Function | 632 |EER (%) | 633 |t-DCF | 634 ||
Multi-task learning in utterance-level and segmental-level spoof detection Paper |
637 | LFCC | 638 |SELCNN, Bi-LSTM | 639 |P2SGrad | 640 |— | 641 |— | 642 |
SA-SASV: An end-to-end spoof-aggregated spoofing-aware speaker verification system Paper Code |
645 | Fbanks, Sinc Filter | 646 |ECAPA-TDNN, ARawNet | 647 |BCE, AAM Softmax, CE | 648 |LA: 4.86 (4) | 649 |— | 650 |
STATNet: Spectral and temporal features based multi-task network for audio spoofing detection Paper |
653 | Sinc Filter | 654 |RawNet2, TCM, SCM | 655 |CE | 656 |LA: 2.45 (3) | 657 |LA: 0.062 (2) | 658 |
A probabilistic fusion framework for spoofing aware speaker verification Paper Code |
661 | Mel Filter, Sinc Filter | 662 |ECAPA-TDNN, AASIST | 663 |BCE | 664 |LA: 1.53 (2) | 665 |— | 666 |
DSVAE: Interpretable disentangled representation for synthetic speech detection Paper |
669 | Spectrogram | 670 |VAE | 671 |KL Divergence Loss, BCE | 672 |LA: 6.56 (5) | 673 |— | 674 |
End-to-end dual-branch network towards synthetic speech detection Paper Code |
677 | LFCC, CQT | 678 |Dual-Branch Network | 679 |Classification Loss, Fake Type Classification Loss | 680 |LA: 0.80 (1) | 681 |LA: 0.021 (1) | 682 |