└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | > [!IMPORTANT]
  2 | > 
  3 | > A curated collection of papers and resources on Audio Deepfake Detection (ADD).
  4 | > 
  5 | > Please refer to our survey [**"Research progress on speech deepfake and its detection techniques"**](https://www.cjig.cn/en/article/doi/10.11834/jig.230476/?viewType=HTML) for the detailed contents. [![Paper page](https://huggingface.co/datasets/huggingface/badges/raw/main/paper-page-sm-dark.svg)](https://www.cjig.cn/en/article/doi/10.11834/jig.230476/?viewType=HTML)
  6 | > 
  7 | > Please let us know if you discover any mistakes or have suggestions by emailing us: xuyuxiong2022@email.szu.edu.cn
  8 | 
  9 | # Table of contents
 10 | 
 11 | - [Table of contents](#table-of-contents)
 12 | - [What's New](#whats-new)
 13 | - [Survey](#survey)
 14 | - [Top Repositories](#top-repositories)
 15 | - [Audio Large Model](#audio-large-model)
 16 | - [Datasets](#datasets)
 17 | - [Audio Preprocessing](#audio-preprocessing)
 18 |   - [Commonly Used Noise Datasets](#commonly-used-noise-datasets)
 19 |   - [Audio Enhancement Methods](#audio-enhancement-methods)
 20 | - [Feature Extraction](#feature-extraction)
 21 |   - [Handcrafted Feature-based Forgery Detection](#handcrafted-feature-based-forgery-detection)
 22 |   - [Hybrid Feature-based Forgery Detection](#hybrid-feature-based-forgery-detection)
 23 |   - [End-to-end Forgery Detection](#end-to-end-forgery-detection)
 24 |   - [Feature Fusion-based Forgery Detection](#feature-fusion-based-forgery-detection)
 25 | - [Network Training](#network-training)
 26 |   - [Multi-task Learning-based Forgery Detection](#multi-task-learning-based-forgery-detection)
 27 | - [Reference](#reference)
 28 | - [Statement](#statement)
 29 | 
 30 | # What's New
 31 | - [Update Jun. 3, 2025] 🔥🔥🔥 Updated [Datasets](#Datasets), [Survey](#survey), and [Top Repositories](#top-repositories).
 32 | - [Update Feb. 20, 2025] Updated [Datasets](#Datasets), Added [Survey](#survey) and [Top Repositories](#top-repositories).
 33 | 
 34 | # Survey
 35 | [**⬆ Back to top**](#table-of-contents)
 36 | - `【2022-04】-【ADD-Survey】-【Ben Gurion University】`
 37 |   - **A study on data augmentation in voice anti-spoofing**
 38 |   - **Author(s):** Ariel Cohen, Inbal Rimon, Eran Aflalo, Haim H. Permuter
 39 |   - [Paper](https://doi.org/10.1016/j.specom.2022.04.005) [Code](https://github.com/InbalRim/A-Study-On-Data-Augmentation-In-Voice-Anti-Spoofing)
 40 | - `【2022-05】-【ADD-Survey】-【King Saud University】`
 41 |   - **A review of modern audio deepfake detection methods: challenges and future directions**
 42 |   - **Author(s):** Zaynab Almutairi, and Hebah Elgibreen
 43 |   - [Paper](https://doi.org/10.3390/a15050155)
 44 | - `【2023-01】-【ADD-Survey】-【University of Maryland Baltimore County】`
 45 |   - **Audio deepfakes: A survey**
 46 |   - **Author(s):** Zahra Khanjani, Gabrielle Watson and Vandana P. Janeja
 47 |   - [Paper](10.3389/fdata.2022.1001063)
 48 | - `【2023-02】-【ADD-Survey】-【Oakland University】`
 49 |   - **Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures**
 50 |   - **Author(s):** Awais Khan, Khalid Mahmood Malik, James Ryan, and Mikul Saravanan
 51 |   - [Paper](https://doi.org/10.1007/s10462-023-10539-8) [Code](https://github.com/smileslab/Comparative-Analysis-Voice-Spoofing)
 52 | - `【2023-04】-【ADD-Survey】-【Panjab University】`
 53 |   - **Review of audio deepfake detection techniques: Issues and prospects**
 54 |   - **Author(s):** Abhishek Dixit, Nirmal Kaur, Staffy Kingra
 55 |   - [Paper](https://doi.org/10.1111/exsy.13322)
 56 | - `【2023-07】-【ADD-Survey】-【Indian Institute of Technology】`
 57 |   - **Uncovering the deceptions: an analysis on audio spoofing detection and future prospects**
 58 |   - **Author(s):** Rishabh Ranjan, Mayank Vatsa, Richa Singh
 59 |   - [Paper](https://doi.org/10.24963/ijcai.2023/756)
 60 | - `【2023-08】-【ADD-Survey】-【CASIA】`
 61 |   - **Audio deepfake detection: a survey**
 62 |   - **Author(s):** Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao
 63 |   - [Paper](https://arxiv.org/abs/2308.14970)
 64 | - `【2024-02】-【Multimodal-Survey】-【Purdue University, Nanchang University】`
 65 |   - **Detecting Multimedia Generated by Large AI Models: A Survey**
 66 |   - **Author(s):** Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu
 67 |   - [Paper](https://arxiv.org/abs/2402.00045) [Code](https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey)
 68 | - `【2024-04】-【ADD-Survey】-【Toronto Metropolitan University】`
 69 |   - **A Survey on Speech Deepfake Detection**
 70 |   - **Author(s):** Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang
 71 |   - [Paper](https://doi.org/10.1145/3714458)
 72 | - `【2024-05】-【Multimodal-Survey】-【Cool Large Language Models Research Group, Renmin University of China】`
 73 |   - **Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities**
 74 |   - **Author(s):** Xiaomin Yua, Yezhaohui Wanga, Yanfang Chenb, Zhen Tao, Dinghao Xi, Shichao Song, and Simin Niu
 75 |   - [Paper](https://arxiv.org/abs/2405.00711)
 76 | - `【2024-09】-【ADD-Survey】-【Austrian Institute of Technology】`
 77 |   - **A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection**
 78 |   - **Author(s):** Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Huyen Nguyen, Alexander Schindler, Hai Canh Vu
 79 |   - [Paper](https://doi.org/10.1016/j.cosrev.2025.100757) [Code](https://github.com/AI-ResearchGroup/A-Comprehensive-Survey-with-Critical-Analysis-for-Deepfake-Speech-Detection)
 80 | - `【2024-11】-【Multimodal-Survey】-【University of Bucharest】`
 81 |   - **Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook**
 82 |   - **Author(s):** Forinel-Alin Croitoru, Andrei-Iulian Hıˆji, Vlad Hondru, Nicolae Cat  ̆alin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah
 83 |   - [Paper](https://arxiv.org/abs/2411.19537) [Code](https://github.com/CroitoruAlin/biodeep)
 84 | - `【2024-11】-【Multimodal-Survey】-【University College Dublin】`
 85 |   - **Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey**
 86 |   - **Author(s):** Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
 87 |   - [Paper](https://arxiv.org/abs/2411.17911)
 88 | - `【2024-12】-【ADD-Survey】-【Imperial College London, Technical University of Munich】`
 89 |   - **From Audio Deepfake Detection to AI-Generated Music Detection——A Pathway and Overview**
 90 |   - **Author(s):** Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
 91 |   - [Paper](https://arxiv.org/abs/2412.00571)
 92 | - `【2025-02】-【Multimodal-Survey】-【BUPT, University of California, CASIA】`
 93 |   - **Survey on AI-Generated Media Detection: From Non-MLLM to MLLM**
 94 |   - **Author(s):** Yueying Zou, Peipei Li, Zekun Li, Huaibo Huang, Xing Cui, Xuannan Liu, Chenghanyu Zhang, Ran He
 95 |   - [Paper](https://arxiv.org/abs/2502.05240)
 96 | 
 97 | # Top Repositories
 98 | [**⬆ Back to top**](#table-of-contents)
 99 | - **Audio Large Language Models**
100 |     - Resources on Audio Large Language Models, including datasets, methods, benchmarks, and studies.
101 |     - [Repo](https://github.com/AudioLLMs/Awesome-Audio-LLM)
102 | - **awesome-fake-audio-detection**
103 |     - A list of tools, papers and code related to Fake Audio Detection.
104 |     - [Repo](https://github.com/john852517791/awesome-fake-audio-detection?tab=readme-ov-file)
105 | - **ASVspoof Challenge Official Repository**
106 |     - Sub-repositories include [ASVspoof 5](https://github.com/asvspoof-challenge/asvspoof5), [ASVspoof 2021](https://github.com/asvspoof-challenge/2021), and [classifier-adjacency](https://github.com/asvspoof-challenge/classifier-adjacency).
107 |     - [Repo](https://github.com/asvspoof-challenge)
108 | 
109 | # <span id="SpeechLargeModel">Audio Large Model</span>
110 | [**⬆ Back to top**](#table-of-contents)
111 | |  Model 	 |   Publisher  | Years| Achievable Tasks    |
112 | |:----:|:-------------:|:--------------:|:--------------------|
113 | | AudioLM  <br>  [Paper](https://arxiv.org/pdf/2209.03143) [Website](https://google-research.github.io/seanet/audiolm/examples/) [Code](https://github.com/lucidrains/audiolm-pytorch) |  Google | 2022.09 |1. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. <br> 2. Speech continuation, Acoustic generation, Unconditional generation, Generation without semantic tokens, and Piano continuation. |
114 | | VALL-E  <br>  [Paper](https://arxiv.org/abs/2301.02111) [Website](https://www.microsoft.com/en-us/research/project/vall-e-x/)	 |   Microsoft | 2023.01|1. Simply record a 3-second registration of an unseen speaker to create a high-quality personalised speech.  <br> 2. VALL-E X: Cross-lingual speech synthesis. |
115 | |   USM  <br>  [Website](https://sites.research.google/usm/)  |     Google  | 2023.03| 1. ASR beyond 100 languages.<br> 2. Downstream ASR tasks. <br> 3. Automated Speech Translation (AST).    |
116 | |  SpeechGPT <br>  [Website](https://github.com/0nutation/SpeechGPT) 	   | Fudan University |2023.05| 1. Perceive and generate multi-modal contents. <br> 2. Spoken dialogue LLM with strong human instruction.     |
117 | |  Pengi <br>  [Paper](https://arxiv.org/pdf/2305.11834.pdf) [Website](https://github.com/microsoft/Pengi)  | Microsoft |2023.05| 1. an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. <br> 2. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions.     |
118 | |  VoiceBox   <br>  [Website](https://voicebox.metademolab.com/)  | Meta  | 2023.06| 1. Synthesize speech across six languages.<br> 2. Remove transient noise.<br> 3. Edit content.<br> 4. Transfer audio style within and across languages.<br>5. Generate diverse speech samples. 	 |
119 | |  AudioPaLM <br>  [Paper](https://arxiv.org/pdf/2306.12925) [Website](https://google-research.github.io/seanet/audiopalm/examples)	    | Google | 2023.06|1. Speech-to-speech translation. <br> 2. Automatic Speech Recognition (ASR).  |
120 | 
121 | # Datasets
122 | [**⬆ Back to top**](#table-of-contents)
123 | |  Attack Types  | Years| Dataset   |  Number of Audio  <br>（Subdataset：Real/Fake） 	  |    Language   |
124 | |:-----------:|:------------:|:------------:|:-------------:|:------------:|
125 | |TTS|2019|FoR<br>[Paper](https://ieeexplore.ieee.org/document/8906599) [Dataset](http://bil.eecs.yorku.ca/datasets)|111,000/87,000|English|
126 | |TTS|2021|WaveFake<br>[Paper](https://arxiv.org/abs/2111.02813) [Dataset](https://zenodo.org/record/5642694)|16,283/117,985|English, Japanese|
127 | |TTS|2021|Half-truth<br>[Paper](https://arxiv.org/abs/2104.03617)|53,612/107,224|Chinese|
128 | |TTS|2021|HAD<br>[Paper](https://arxiv.org/abs/2104.03617)|53,612/107,224|Chinese|
129 | |TTS|2021|FakeAVCeleb <br>[Paper](https://arxiv.org/pdf/2108.05080) [Dataset](https://github.com/DASH-Lab/FakeAVCeleb/tree/main)|10,209/11,335|English|
130 | |TTS|2022|ADD 2022<br>[Paper](https://arxiv.org/pdf/2202.08433v2.pdf)|LF: 5,619/46,067<br>PF: 5,319/46,419<br>FG-D: 5,319/46,206|Chinese|
131 | |TTS|2022|CMFD<br> [Paper](https://ieeexplore.ieee.org/document/9921343) [Dataset](https://github.com/WuQinfang/CMFD)|Chinese: 1,800/1,000<br>English: 1,800/1,000|English, Chinese|
132 | |TTS|2022|In-the-Wild<br>[Paper](https://arxiv.org/abs/2203.16263) [Dataset](https://deepfake-demo.aisec.fraunhofer.de/in_the_wild)|19,963/11,816|English|
133 | |TTS|2022|CFAD<br>[Paper](https://arxiv.org/abs/2207.12308) [Dataset](https://zenodo.org/record/8122764)|38,600/77,200|Chinese|
134 | |TTS|2022|Psynd<br>[Paper](https://ieeexplore.ieee.org/document/9956134) [Dataset](https://scholarbank.nus.edu.sg/handle/10635/227398)|2,294|English|
135 | |TTS|2022|TIMIT-TTS<br>[Paper](https://arxiv.org/abs/2209.08000) [Dataset](https://zenodo.org/records/6560159)|0/79,120|English|
136 | |TTS|2023|ODSS<br>[Paper](https://ieeexplore.ieee.org/abstract/document/10374863) [Dataset](https://zenodo.org/records/8370669)|11,032/18,993|English, German,<br>and Spanish|
137 | |TTS|2024|MLAAD<br>[Paper](https://arxiv.org/abs/2401.09512) [Dataset](https://deepfake-total.com/mlaad)|-/76,000|Multi-lingual|
138 | |TTS|2024|CD-ADD<br>[Paper](https://arxiv.org/abs/2404.04904)|300 hours|English|
139 | |TTS|2024|DiffSSD<br>[Paper](https://arxiv.org/abs/2409.13049) [Dataset](https://huggingface.co/datasets/purdueviperlab/diffssd)|24,226/70,000|English|
140 | |TTS|2024|ACCENT<br>[Paper](https://arxiv.org/abs/2409.08346) |53,651/192,461|Multi-lingual|
141 | |TTS|2024|SpoofCeleb<br>[Paper](https://arxiv.org/abs/2409.17285) [Dataset](http://www.jungjee.com/spoofceleb/)|250k+|English|
142 | |TTS|2024|LlamaPartialSpoof<br>[Paper](https://arxiv.org/abs/2409.14743) [Dataset](https://github.com/hieuthi/LlamaPartialSpoof)|10,573/33,479|English|
143 | |TTS|2024|FakeSound<br>[Paper](https://arxiv.org/abs/2406.08052) [Dataset](https://github.com/FakeSoundData/FakeSound)|-/3,798|English|
144 | |TTS|2024|DFADD<br>[Paper](https://arxiv.org/abs/2409.08731) [Dataset](https://github.com/isjwdu/DFADD)|44,455/163,500|English|
145 | |TTS|2024|SONAR<br>[Paper](https://arxiv.org/abs/2410.04324) [Dataset](https://github.com/Jessegator/SONAR)|-/2,274|Multi-lingual| 
146 | |TTS|2025|6KSFx<br>[Paper](https://arxiv.org/abs/2501.17198) [Dataset](https://github.com/nellyngz95/6KSFX)|-/6,000|-| 
147 | |TTS|2025|BangalFake<br>[Paper](https://arxiv.org/html/2505.10885) [Dataset](https://huggingface.co/datasets/sifat1221/banglaFake)|12,260/13,260|Bengali| 
148 | |Replay|2017|ASVspoof 2017<br>[Paper](https://www.asvspoof.org/asvspoof2017overview_cameraReady.pdf) [Dataset](https://datashare.ed.ac.uk/handle/10283/3055)|3,565/14,465|English|
149 | |Replay|2019|ReMASC<br>[Paper](https://arxiv.org/abs/1904.03365) [Dataset](https://github.com/YuanGongND/ReMASC)|9,240/45,472|English, Chinese,<br>Hindi|
150 | |Replay|2019|VSDC<br>[Paper](https://arxiv.org/abs/1909.00935) [Dataset](http://www.secs.oakland.edu/∼mahmood/datasets/audiospoof)|1,687/11,772|English|
151 | |Replay|2024|POLIPHONE<br>[Paper](https://arxiv.org/abs/2410.06221) [Dataset](https://zenodo.org/records/13903412)|41,941|English|
152 | |TTS <br>and VC|2015|AVspoof<br>[Paper](https://ieeexplore.ieee.org/document/7358783) [Dataset](https://zenodo.org/record/4081040)|LA: 15,504/120,480<br>PA: 15,504/14,465|English|
153 | |TTS <br>and VC|2015|ASVspoof 2015<br>[Paper](https://datashare.ed.ac.uk/bitstream/handle/10283/853/ASVspoof2015_evaluation_plan.pdf?sequence=2&isAllowed=y) [Dataset](http://dx.doi.org/10.7488/ds/298)|16,651/246,500|English|
154 | |TTS <br>and VC|2021|FMFCC-A<br> [Paper](https://arxiv.org/abs/2110.09441) [Dataset](https://github.com/Amforever/FMFCC-A)|10,000/40,000|Chinese|
155 | |TTS <br>and VC|2022|SceneFake<br>[Paper](https://arxiv.org/abs/2211.06073) [Dataset](https://zenodo.org/record/7663324#.Y_XKMuPYuUk)|19,838/64,642|English|
156 | |TTS <br>and VC|2022|EmoFake<br>[Paper](http://arxiv.org/abs/2211.05363)|35,000/53,200|English, Chinese|
157 | |TTS <br>and VC|2023|PartialSpoof<br>[Paper](https://arxiv.org/abs/2104.02518) [Dataset](https://zenodo.org/record/4817532#.YLd8Yi2l1hF)|12,483/108,978|English|
158 | |TTS <br> and VC|2023|ADD 2023<br>[Paper](https://arxiv.org/abs/2305.13774)|FG-D: 172,819/113,042<br>RL: 55,468/65,449<br>AR: 14,907/95,383|Chinese|
159 | |TTS <br>and VC|2023|DECRO<br>[Paper](https://dl.acm.org/doi/10.1145/3543507.3583222) [Dataset](https://github.com/petrichorwq/DECRO-dataset)|Chinese: 21,218/41,880<br>English: 12,484/42,799|English, Chinese|
160 | |TTS <br>and VC|2023|HABLA<br>[Paper](10.21437/Interspeech.2023-2272) [Dataset](https://github.com/Ruframapi/HABLA)|22,000/58,000|Spanish|
161 | |TTS <br>and VC|2024|DeepFakeVox-HQ<br>[Paper](https://ieeexplore.ieee.org/document/10800535)|693k/643k|English|
162 | |TTS <br>and VC|2024|VoiceWukong<br>[Paper](https://arxiv.org/abs/2409.06348) [Dataset](https://voicewukong.github.io)|5,300/413,400|English, Chinese|
163 | |TTS <br>and VC|2024|Speech-Forensics<br>[Paper](https://www.ijcai.org/proceedings/2024/0046) [Dataset](https://github.com/ring-zl/Speech-Forensics)|13,100/7,452|English|
164 | |TTS <br>and VC|2024|VoiceEdit<br>[Paper](https://arxiv.org/abs/2402.06304)|-|Multi-lingual|
165 | |TTS <br>and VC|2024|RFP<br>[Paper](https://arxiv.org/abs/2404.17721) [Dataset](https://zenodo.org/records/10202142)|28,115/74,199|English|
166 | |TTS <br>and VC|2025|MADD<br>[Paper](https://ieeexplore.ieee.org/document/10800535)|60,000/129,990|Multi-lingual|
167 | |TTS <br>and VC|2025|XMAD-Bench<br>[Paper](https://arxiv.org/pdf/2506.00462) [Dataset](https://github.com/ristea/xmad-bench/)|414,858|Multi-lingual|
168 | |TTS <br>and Vocoder|2024|Diffuse or Confuse<br>[Paper](https://arxiv.org/abs/2410.06796) [Dataset](https://github.com/AntonFirc/diffusion-deepfake-speech-dataset/)|131,000/183,400|English|
169 | |TTS <br>and Vocoder|2025|ShiftySpeech<br>[Paper](https://arxiv.org/abs/2502.05674) [Dataset](https://huggingface.co/datasets/ash56/ShiftySpeech)|3,000+ hours|English, Chinese,<br>Japanese|
170 | |TTS, VC<br>and Replay|2019|ASVspoof 2019<br>[Paper](https://arxiv.org/abs/1904.05441) [Dataset](https://datashare.ed.ac.uk/handle/10283/3336)|LA: 12,483/108,978<br>PA: 28,890/189,540|English|
171 | |TTS, VC<br>and Replay|2021|ASVspoof 2021<br>[Paper](https://arxiv.org/abs/2109.00535) [Dataset](https://www.asvspoof.org/index2021.html)|LA: 18,452/163,114<br>PA: 126,630/816,480<br>PF: 14,869/519,059|English|
172 | |TTS, VC<br>and Vocoder|2024|SpeechFake<br>[Paper](https://openreview.net/forum?id=GpUO6qYNQG)|3,000+ hours|English|
173 | |VC, Replay<br>and Adversarial|2024|VSASV<br>[Paper](https://www.isca-archive.org/interspeech_2024/hoang24b_interspeech.html) [Dataset](https://github.com/hustep-lab/VSASV-Dataset)|164,000/174,000|Multi-lingual|
174 | |TTS, VC<br>and Adversarial|2024|ASVspoof 5<br>[Paper](https://arxiv.org/abs/2502.08857) [Dataset](https://zenodo.org/records/14498691)|188,819/ 815,262|English|
175 | |Voice Cloning|2021|RTVCSpoof<br>[Paper](https://ojs.aaai.org/index.php/AAAI/article/view/6044/5900)|3,284/4,843|English|
176 | |Voice Cloning|2024|Kratika Dataset<br>[Paper](https://doi.org/10.1145/3658664.3659658)|24,226/25,000|English|
177 | |Vocoder|2022|Yan Dataset<br>[Paper](https://doi.org/10.1145/3552466.3556525)|8,200/63,200|Chinese|
178 | |Vocoder|2023|LibriSeVoc<br>[Paper](https://arxiv.org/abs/2304.13085) [Dataset](https://github.com/csun22/SyntheticVoice-Detection-Vocoder-Artifacts)|13,201/79,206|English|
179 | |Vocoder|2023|Voc.v1-v4<br>[Paper](https://arxiv.org/abs/2210.10570) [Dataset](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/09-asvspoof-vocoded-trn)|2,580/10,320|English|
180 | |Vocoder|2024|MLADDC<br>[Paper](https://openreview.net/forum?id=ic3HvoOTeU) [Dataset](https://speech007.github.io/MLADDC_Nips/)|80k/160k|Multi-lingual|
181 | |Vocoder|2024|CVoiceFake<br>[Paper](https://dl.acm.org/doi/10.1145/3658644.3670285) [Dataset](https://safeearweb.github.io/Project/)|23,544/91,700|Multi-lingual|
182 | |Impersonation|2024|IPAD<br>[Paper](https://arxiv.org/abs/2408.17009)|5,170/18,874|Chinese|
183 | |Text-To-Music (TTM)|2024|FSD<br>[Paper](https://arxiv.org/abs/2309.02232) [Dataset](https://github.com/xieyuankun/FSD-Dataset)|200/500|Chinese|
184 | |Text-To-Music (TTM)|2024|SingFake<br>[Paper](https://arxiv.org/abs/2309.07525) [Dataset](https://github.com/yongyizang/SingFake)|634/671|Multi-lingual|
185 | |Text-To-Music (TTM)|2024|CtrSVDD<br>[Paper](https://arxiv.org/abs/2406.02438) [Dataset](https://github.com/SVDDChallenge/)|32,312/188,486|Multi-lingual|
186 | |Text-To-Music (TTM)|2024|FakeMusicCaps<br>[Paper](https://arxiv.org/abs/2409.10684) [Dataset](https://zenodo.org/records/13732524)|5.5k/27,605|English|
187 | |Text-To-Music (TTM)|2025|SONICS<br>[Paper](https://arxiv.org/abs/2408.14080)|48,090/49,074|English|
188 | |Text-To-Music (TTM)|2025|SingNet<br>[Paper](hhttps://arxiv.org/pdf/2505.09325) [Dataset](https://singnet-dataset.github.io/)|2,963.4 hours|Multi-lingual|
189 | |Codec-based Speech <br>Generation (CoSG)|2024|CodecFake<br>[Paper](https://arxiv.org/abs/2406.07237) [Dataset](https://codecfake.github.io/)|42,752/45,045|English|
190 | |Codec-based Speech <br>Generation (CoSG)|2024|ALM-ADD<br>[Paper](https://arxiv.org/abs/2408.10853) [Dataset](https://github.com/xieyuankun/ALM-ADD)|123/210|English|
191 | |Codec-based Speech <br>Generation (CoSG)|2024|Codecfake<br>[Paper](https://ieeexplore.ieee.org/abstract/document/10830534) [Dataset](https://zenodo.org/records/13838106)|132,277/925,939|English, Chinese|
192 | |Codec-based Speech <br>Generation (CoSG)|2025|ST-Codecfake<br>[Paper](https://arxiv.org/pdf/2501.06514) [Dataset](https://zenodo.org/records/14631091)|13,228/145,778|English, Chinese|
193 | |Codec-based Speech <br>Generation (CoSG)|2025|CodecFake+<br>[Paper](https://arxiv.org/abs/2501.08238)|90,163/1,423,894|English|
194 | 
195 | 
196 | <!-- |codec-based speech <br> generation (CoSG)|2024|Codecfake <br>[Paper](https://arxiv.org/abs/2406.08112) [Dataset](https://zenodo.org/records/11171708) |132,277/925,939|English, Chinese| -->
197 | 
198 | # <span id="audiopreprocessing">Audio Preprocessing</span>
199 | 
200 | ## <span id="noise1">Commonly Used Noise Datasets</span>
201 | [**⬆ Back to top**](#table-of-contents)
202 | |                                       Dataset 	                                        |                                                                                                                                     Description 	                                                                                                                                     |
203 | |:--------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
204 | |                   MUSAN <br>  [Dataset](https://www.openslr.org/17/)                   |       A corpus of music, speech and noise                      |
205 | |                      RIR <br>  [Dataset](https://www.openslr.org/28/)                       |   A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.  |
206 | |           NOIZEUS <br> [Dataset](https://ecs.utdallas.edu/loizou/speech/noizeus/)           |       Contains 30 IEEE sentences (generated by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. Noises include suburban train noise, murmur, car, exhibition hall, restaurant, street, airport and train station noise.        |
207 | |               NoiseX-92  <br> [Dataset](http://spib.linse.ufsc.br/noise.html)               |                             All noises are obtained with a duration of 235 seconds, a sampling rate of 19.98 KHz, an analogue-to-digital converter (A/D) with 16 bits, an anti-alias filter and no pre-emphasis stage. Fifteen noise types are included.                              |
208 | |            DEMAND <br> [Dataset](https://zenodo.org/record/1227121#.YXtsyPlBxjU)            |                                                                                                            Multi-channel acoustic noise database for diverse environments.                                                                                                            |
209 | |                ESC-50 <br> [Dataset](https://github.com/karolpiczak/ESC-50)                 | A tagged collection of 2000 environmental audios obtained from clips in Freesound.org, suitable for environmental sound classification. The dataset consists of 5-second-long recordings organised into 5 broad categories, each with 10 subcategories (40 examples per subcategory). |
210 | | ESC <br> [Dataset](http://shujujishi.com/dataset/69b2bf03-d855-4f8b-ab96-1ec80e285863.html) |                                                                                                                       Including the ESC-50, ESC-10, and ESC-US.                                                                                                                       |
211 | |            FSD50K <br> [Dataset](https://zenodo.org/record/4060432#.Y1kvcHZByUk)            |   An open dataset of human tagged sound events containing 51,197 Freesound clips totalling 108.3 hours of multi-labeled audio, unequally distributed across 200 classes from the AudioSet Ontology.                                                                                                          |
212 | 
213 | ## <span id="ae">Audio Enhancement Methods</span>
214 | [**⬆ Back to top**](#table-of-contents)
215 | |                                                                                                                                                                               Method 	                                                                                                                                                                               |                                                                Description 	                                                                 |
216 | |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------:|
217 | |                                                                                                                           SpecAugment <br> [Paper](https://arxiv.org/pdf/1904.08779.pdf)   [Code](https://github.com/DemisEom/SpecAugment)                                                                                                                           |                               Enhancement strategies include time warping, frequency masking and time masking                                |  
218 | |                                                                                                                     WavAugment <br> [Paper](https://arxiv.org/abs/2007.00991)  [Code](https://github.com/facebookresearch/WavAugment)                                                                                                                     | Enhancement strategies include pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject and clipping | 
219 | |                                                              RawBoost  <br> [Paper](https://arxiv.org/abs/2007.00991)      [Code](https://github.com/TakHemlata/RawBoost-antispoofing)                                                               | Enhancement strategies include linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise |    
220 | 
221 | # <span id="featureextraction">Feature Extraction</span>
222 | 
223 | ## <span id="hf1">Handcrafted Feature-based Forgery Detection</span>
224 | [**⬆ Back to top**](#table-of-contents)
225 | <table>
226 | 	<tr>
227 | 	    <td align="center" rowspan="2">Paper</td>
228 | 	    <td align="center" colspan="4" >Audio Deepfake Detection</td>
229 | 	    <td align="center" colspan="2">Results</td>  
230 | 	</tr >
231 | 	<tr >
232 |         <td align="center">Data Augmentation</td>
233 |         <td align="center">Feature Extraction</td>
234 |         <td align="center">Network Framework</td>
235 |         <td align="center">Loss Function</td>
236 |         <td align="center">EER (%)</td>
237 |         <td align="center">t-DCF</td>
238 | 	</tr>
239 | 	<tr>
240 | 	    <td align="center">Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge <br> <a href="https://arxiv.org/abs/1907.12908">Paper</a>  <a href="https://github.com/mravanelli/SincNet">Code</a></td>
241 | 	    <td align="center">—</td>
242 |         <td align="center">CQT, Power Spectrum</td>
243 |         <td align="center">VGG, SincNet</td>
244 |         <td align="center">CE</td>
245 |         <td align="center">LA: 8.01 (4)<br>PA: 1.51 (2)</td>
246 |         <td align="center">LA: 0.208 (4)<br><b>PA: 0.037 (1)</b></td>
247 | 	</tr>
248 |     <tr>
249 |         <td align="center">Long-term high frequency features for synthetic speech detection <br> <a href="https://www.sciencedirect.com/science/article/pii/S1051200419301769#:~:text=The%20long-term%20information%20has%20been%20found%20to%20be,from%20the%20long-term%20constant-Q%20transform%20%28CQT%29%20based%20features.">Paper</a></td>
250 |         <td align="center">Cafe, White and Street Noise</td>
251 |         <td align="center">ICQC, ICQCC, ICBC, ICLBC</td>
252 |         <td align="center">DNN</td>
253 |         <td align="center">CE</td>
254 |         <td align="center">LA: 7.78 (3)</td>
255 |         <td align="center">LA: 0.187 (3)</td>
256 |     </tr>
257 |     <tr>
258 |         <td align="center">Voice spoofing countermeasure for logical access attacks detection <br> <a href="https://ieeexplore.ieee.org/document/9638512">Paper</a></td>
259 |         <td align="center">—</td>
260 |         <td align="center">ELTP-LFCC</td>
261 |         <td align="center">DBiLSTM</td>
262 |         <td align="center">—</td>
263 |         <td align="center"><b>LA: 0.74 (1)</b></td>
264 |         <td align="center"><b>LA: 0.008 (1)</b></td>
265 |     </tr>
266 |     <tr>
267 |         <td align="center">Voice spoofing detector: A unified anti-spoofing framework <br> <a href="https://www.sciencedirect.com/science/article/pii/S0957417422002330">Paper</a></td> 
268 |         <td align="center">—</td>
269 |         <td align="center">ATP-GTCC</td>
270 |         <td align="center">SVM</td>
271 |         <td align="center">Hamming <br> Distance</td>
272 |         <td align="center">LA: 0.75 (2)<br><b>PA: 1.00 (1)</b></td>
273 |         <td align="center">LA: 0.050 (2)<br>PA: 0.064 (2)</td>
274 |     </tr>
275 | </table>
276 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario.
277 | 
278 | ## <span id="hf2">Hybrid Feature-based Forgery Detection</span>
279 | [**⬆ Back to top**](#table-of-contents)
280 | <table>
281 | 	<tr>
282 | 	    <td align="center" rowspan="2">Paper</td>
283 | 	    <td align="center" colspan="4" >Audio Deepfake Detection</td>
284 | 	    <td align="center" colspan="2">Results</td>  
285 | 	</tr >
286 |     <tr>
287 |         <td align="center">Data Augmentation</td>
288 |         <td align="center">Feature Extraction</td>
289 |         <td align="center">Network Structure</td>
290 |         <td align="center">Loss Function</td>
291 |         <td align="center">EER (%)</td>
292 |         <td align="center">t-DCF</td>
293 |     </tr>
294 |     <tr>
295 |         <td align="center">Light convolutional neural network with feature genuinization for detection of synthetic speech attacks <br> <a href="https://arxiv.org/pdf/2009.09637.pdf">Paper</a></td>
296 |         <td align="center">—</td>
297 |         <td align="center">CQT-based LPS</td>
298 |         <td align="center">LCNN</td>
299 |         <td align="center">—</td>
300 |         <td align="center">LA: 4.07 (11)</td>
301 |         <td align="center">LA: 0.102 (10)</td>
302 |     </tr>
303 |     <tr>
304 |         <td align="center">Siamese convolutional neural network using gaussian probability feature for spoofing speech detection <br> <a href="http://www.interspeech2020.org/uploadfile/pdf/Mon-3-2-9.pdf">Paper</a></td>
305 |         <td align="center">—</td>
306 |         <td align="center">LFCC</td>
307 |         <td align="center">Siamese CNN</td>
308 |         <td align="center">CE</td>
309 |         <td align="center">LA: 3.79 (10)<br>PA: 7.98 (5)</td>
310 |         <td align="center">LA: 0.093 (5)<br>PA: 0.195 (2)</td>
311 |     </tr>
312 |     <tr>
313 |         <td align="center">Generalization of audio deepfake detection <br> <a href="https://www.researchgate.net/profile/Avrosh-Kumar/publication/345141913_Generalization_of_Audio_Deepfake_Detection/links/600cb38945851553a0678e07/Generalization-of-Audio-Deepfake-Detection.pdf">Paper</a></td>
314 |         <td align="center">RIR and MUSAN</td>
315 |         <td align="center">LFB</td>
316 |         <td align="center">ResNet18</td>
317 |         <td align="center">LCML</td>
318 |         <td align="center">LA: 1.81 (4)</td>
319 |         <td align="center">LA: 0.052 (4)</td>
320 |     </tr>
321 |     <tr>
322 |         <td align="center">Continual learning for fake audio detection <br> <a href="https://arxiv.org/abs/2104.07286">Paper</a></td>
323 |         <td align="center">—</td>
324 |         <td align="center">LFCC</td>
325 |         <td align="center">LCNN, DFWF</td>
326 |         <td align="center">Similarity Loss</td>
327 |         <td align="center">LA: 7.74 (15)<br>PA: 8.85 (6)</td>
328 |         <td align="center">—</td>
329 |     </tr>
330 |     <tr>
331 |         <td align="center">Partially-connected differentiable architecture search for deepfake and spoofing detection <br> <a href="https://arxiv.org/abs/2104.03123v1">Paper</a> <a href="https://github.com/eurecom-asp/pc-darts-anti-spoofing">Code</a></td>
332 |         <td align="center">Frequency Mask</td>
333 |         <td align="center">LFCC</td>
334 |         <td align="center">PC-DARTS</td>
335 |         <td align="center">WCE</td>
336 |         <td align="center">LA: 4.96 (12)</td>
337 |         <td align="center">LA: 0.091 (8)</td>
338 |     </tr>
339 |     <tr>
340 |         <td align="center">One-class learning towards synthetic voice spoofing detection <br> <a href="https://ieeexplore.ieee.org/abstract/document/9417604">Paper</a> <a href="https://github.com/yzyouzhang/AIR-ASVspoof">Code</a></td>
341 |         <td align="center">—</td>
342 |         <td align="center">LFCC</td>
343 |         <td align="center">ResNet18</td>
344 |         <td align="center">OC-Softmax</td>
345 |         <td align="center">LA: 2.19 (7)</td>
346 |         <td align="center">LA: 0.059 (5)</td>
347 |     </tr>
348 |     <tr>
349 |         <td align="center">Replay and synthetic speech detection with res2net architecture <br> <a href="https://ieeexplore.ieee.org/abstract/document/9413828">Paper</a> <a href="https://github.com/lixucuhk/ASV-anti-spoofing-with-Res2Net">Code</a></td>
350 |         <td align="center">—</td>
351 |         <td align="center">CQT</td>
352 |         <td align="center">SE-Res2Net50</td>
353 |         <td align="center">BCE</td>
354 |         <td align="center">LA: 2.50 (8)<br>PA: 0.46 (2)</td>
355 |         <td align="center">LA: 0.074 (7)<br>PA: 0.012 (2)</td>
356 |     </tr>
357 |     <tr>
358 |         <td align="center">An empirical study on channel effects for synthetic voice spoofing countermeasure systems <br> <a href="https://arxiv.org/abs/2104.01320">Paper</a> <a href="https://github.com/yzyouzhang/Empirical-Channel-CM">Code</a></td>
359 |           <td align="center">Telephone Codecs, and Device/Room Impulse Responses (IRs).</td>
360 |         <td align="center">LFCC</td>
361 |         <td align="center">LCNN, ResNet-OC</td>
362 |         <td align="center">OC-Softmax, CE</td>
363 |         <td align="center">LA: 3.92 (10)</td>
364 |         <td align="center">—</td>
365 |     </tr>
366 |     <tr>
367 |         <td align="center">Efficient attention branch network with combined loss function for automatic speaker verification spoof detection <br> <a href="https://link.springer.com/article/10.1007/s00034-023-02314-5">Paper</a> <a href="https://github.com/AmirmohammadRostami/ASV-anti-spoofingwith-EABN">Code</a></td>
368 |         <td align="center">SpecAug, Attention Mask</td>
369 |         <td align="center">LFCC</td>
370 |         <td align="center">EfficientNet-A0, SE-Res2Net50</td>
371 |         <td align="center">WCE, Triplet Loss</td>
372 |         <td align="center">LA: 1.89 (6)<br>PA: 0.86 (4)</td>
373 |         <td align="center">LA: 0.507 (11)<br>PA: 0.024 (4)</td>
374 |     </tr>
375 |     <tr>
376 |         <td align="center">Resmax: Detecting voice spoofing attacks with residual network and max feature map <br> <a href="https://ieeexplore.ieee.org/document/9412165">Paper</a></td>
377 |         <td align="center">—</td>
378 |         <td align="center">CQT</td>
379 |         <td align="center">ResMax</td>
380 |         <td align="center">BCE</td>
381 |         <td align="center">LA: 2.19 (7)<br><b>PA: 0.37 (1)</b></td>
382 |         <td align="center">LA: 0.060 (6)<br><b>PA: 0.009 (1)</b></td>
383 |     </tr>
384 |     <tr>
385 |         <td align="center">Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture <br> <a href="https://ieeexplore.ieee.org/document/10037999">Paper</a></td>
386 |         <td align="center">Adding noise according to a signal-to-noise ratio of 15dB or 25dB</td>
387 |         <td align="center">CQT</td>
388 |         <td align="center">SE-Res2Net34-Confromer</td>
389 |         <td align="center">CE</td>
390 |         <td align="center">LA: 1.85 (5)</td>
391 |         <td align="center">LA: 0.060 (6)</td>
392 |     </tr>
393 |     <tr>
394 |         <td align="center">Fastaudio: A learnable audio front-end for spoof speech detection <br> <a href="https://ieeexplore.ieee.org/document/9746722">Paper</a> <a href="https://github.com/magnumresearchgroup/Fastaudio">Code</a></td>
395 |         <td align="center">—</td>
396 |         <td align="center">L-VQT</td>
397 |         <td align="center">L-DenseNet</td>
398 |         <td align="center">NLLLoss</td>
399 |         <td align="center">LA: 1.54 (3)</td>
400 |         <td align="center">LA: 0.045 (3)</td>
401 |     </tr>
402 |     <tr>
403 |         <td align="center">Learning from yourself: A self-distillation method for fake speech detection <br> <a href="https://arxiv.org/abs/2303.01211">Paper</a></td>
404 |         <td align="center">—</td>
405 |         <td align="center">LPS, F0</td>
406 |         <td align="center">ECANet, SENet</td>
407 |         <td align="center">A-Softmax</td>
408 |         <td align="center">LA: 1.00 (2)<br>PA: 0.65 (3)</td>
409 |         <td align="center">LA: 0.031 (2)<br>PA: 0.017 (3)</td>
410 |     </tr>
411 |     <tr>
412 |         <td align="center">How to boost anti-spoofing with x-vectors <br> <a href="https://ieeexplore.ieee.org/document/10022504">Paper</a></td>
413 |         <td align="center">—</td>
414 |         <td align="center">LFCC, MFCC</td>
415 |         <td align="center">TDNN, SENet34</td>
416 |         <td align="center">LCML</td>
417 |         <td align="center"><b>LA: 0.83 (1)</b></td>
418 |         <td align="center"><b>LA: 0.024 (1)</b></td>
419 |     </tr>
420 | </table>
421 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario.
422 | 
423 | 
424 | ## <span id="ete">End-to-end Forgery Detection</span>
425 | [**⬆ Back to top**](#table-of-contents)
426 | <table>
427 | 	<tr>
428 | 	    <td align="center" rowspan="2">Paper</td>
429 | 	    <td align="center" colspan="4" >Audio Deepfake Detection</td>
430 | 	    <td align="center" colspan="2">Results</td>  
431 | 	</tr >
432 |     <tr>
433 |         <td align="center">Data Augmentation</td>
434 |         <td align="center">Feature Extraction</td>
435 |         <td align="center">Network Structure</td>
436 |         <td align="center">Loss Function</td>
437 |         <td align="center">EER (%)</td>
438 |         <td align="center">t-DCF</td>
439 |     </tr>
440 |     <tr>
441 |         <td align="center">A light convolutional GRU-RNN deep feature extractor for asv spoofing detection <br> <a href="https://www.semanticscholar.org/paper/A-Light-Convolutional-GRU-RNN-Deep-Feature-for-ASV-Alan%C3%ADs-Peinado/a692a7c971238a24b2ae882a1b6925946ea5e498">Paper</a></td>
442 |         <td align="center">—</td>
443 |         <td align="center">LC-GRNN</td>
444 |         <td align="center">PLDA</td>
445 |         <td align="center">—</td>
446 |         <td align="center">LA: 6.28 (13) <br> PA: 2.23</td>
447 |         <td align="center">LA: 0.152 (10) <br> PA: 0.061</td>
448 |     </tr>
449 |     <tr>
450 |         <td align="center">Rw-resnet: A novel speech anti-spoofing model using raw waveform <br> <a href="https://arxiv.org/abs/2108.05684v2">Paper</a></td>
451 |         <td align="center">—</td>
452 |         <td align="center">1D Convolution Residual Block</td>
453 |         <td align="center">ResNet</td>
454 |         <td align="center">CE</td>
455 |         <td align="center">LA: 2.98 (11)</td>
456 |         <td align="center">LA: 0.082 (9)</td>
457 |     </tr>
458 |     <tr>
459 |         <td align="center">Raw differentiable architecture search for speech deepfake and spoofing detection <br> <a href="https://arxiv.org/abs/2107.12212">Paper</a> <a href="https://github.com/eurecom-asp/raw-pc-darts-anti-spoofing">Code</a></td>
460 |         <td align="center">Masking Filter</td>
461 |         <td align="center">Sinc Filter</td>
462 |         <td align="center">PC-DARTS</td>
463 |         <td align="center">P2SGrad</td>
464 |         <td align="center">LA: 1.77 (10)</td>
465 |         <td align="center">LA: 0.052 (7)</td>
466 |     </tr>
467 |     <tr>
468 |         <td align="center">Towards end-to-end synthetic speech detection <br> <a href="https://arxiv.org/abs/2106.06341">Paper</a> <a href="https://github.com/ghua-ac/end-to-end-synthetic-speech-detection">Code</a></td>
469 |         <td align="center">—</td>
470 |         <td align="center">DNN</td>
471 |         <td align="center">Res-TSSDNet, Inc-TSSDNet</td>
472 |         <td align="center">WCE</td>
473 |         <td align="center">LA: 1.64 (9)</td>
474 |         <td align="center">LA: 0.048 (6)</td>
475 |     </tr> 
476 |     <tr>
477 |         <td align="center">End-to-end anti-spoofing with RawNet2 <br> <a href="https://ieeexplore.ieee.org/abstract/document/9414234">Paper</a> <a href="https://github.com/eurecom-asp/rawnet2-antispoofing">Code</a></td>
478 |         <td align="center">—</td>
479 |         <td align="center">Sinc Filter</td>
480 |         <td align="center">RawNet2</td>
481 |         <td align="center">CE</td>
482 |         <td align="center">LA: 1.12 (5)</td>
483 |         <td align="center">LA: 0.033 (3)</td>
484 |     </tr>
485 |     <tr>
486 |         <td align="center">Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection <br> <a href="https://www.sciencedirect.com/science/article/pii/S1051200421002955">Paper</a></td>
487 |         <td align="center">—</td>
488 |         <td align="center">FastAudio filter</td>
489 |         <td align="center">X-vector, ECAPA-TDNN</td>
490 |         <td align="center">—</td>
491 |         <td align="center">LA: 1.54 (7)</td>
492 |         <td align="center">LA: 0.045 (5)</td>
493 |     </tr>
494 |     <tr>
495 |         <td align="center">Fully automated end-to-end fake audio detection <br> <a href="https://arxiv.org/abs/2208.09618">Paper</a></td>
496 |         <td align="center">Sinc Filter</td>
497 |         <td align="center">Wav2Vec2</td>
498 |         <td align="center">light-DARTS</td>
499 |         <td align="center">Comparative loss</td>
500 |         <td align="center">LA: 1.08 (4)</td>
501 |         <td align="center">—</td>
502 |     <tr>
503 |         <td align="center">Audio anti-spoofing using a simple attention module and joint optimization based on additive angular margin loss and meta-learning <br> <a href="https://arxiv.org/abs/2211.09898">Paper</a></td>
504 |         <td align="center">—</td>
505 |         <td align="center">Sinc Filter</td>
506 |         <td align="center">RawNet2, SimAM</td>
507 |         <td align="center">AAM Softmax, MSE</td>
508 |         <td align="center">LA: 0.99 (3)</td>
509 |         <td align="center">LA: 0.029 (2)</td>
510 |     </tr> 
511 |     <tr>
512 |         <td align="center">AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks <br> <a href="https://arxiv.org/abs/2110.01200">Paper</a> <a href="https://github.com/clovaai/aasist">Code</a></td>
513 |         <td align="center">—</td>
514 |         <td align="center">Sinc Filter</td>
515 |         <td align="center">RawNet2, MGO, HS-GAL</td>
516 |         <td align="center">CE</td>
517 |         <td align="center">LA: 0.83 (2)</td>
518 |         <td align="center"><b>LA: 0.028 (1)</b></td>
519 |     </tr>
520 |     <tr>
521 |         <td align="center">Ai-synthesized voice detection using neural vocoder artifacts <br> <a href="https://arxiv.org/abs/2304.13085">Paper</a> <a href="https://github.com/csun22/Synthetic-Voice-Detection-Vocoder-Artifacts">Code</a></td>
522 |         <td align="center">Resampling, Noise Addition</td>
523 |         <td align="center">Sinc Filter</td>
524 |         <td align="center">RawNet2</td>
525 |         <td align="center">CE, Softmax</td>
526 |         <td align="center">LA: 4.54 (12)</td>
527 |         <td align="center">—</td>
528 |     </tr>
529 |     <tr>
530 |         <td align="center">To-RawNet: Improving rawnet with tcn and orthogonal regularization for fake audio detection <br> <a href="https://arxiv.org/abs/2305.13701">Paper</a></td>
531 |         <td align="center">RawBoost</td>
532 |         <td align="center">Sinc Filter</td>
533 |         <td align="center">RawNet2, TCN</td>
534 |         <td align="center">CE, Orthogonal Loss</td>
535 |         <td align="center">LA: 1.58 (8)</td>
536 |         <td align="center">—</td>
537 |     </tr>
538 |     <tr>
539 |         <td align="center">Speaker-Aware Anti-spoofing <br> <a href="https://arxiv.org/abs/2303.01126">Paper</a></td> 
540 |         <td align="center">—</td>
541 |         <td align="center">Sinc Filter</td>
542 |         <td align="center">AASIST, M2S Converter</td>
543 |         <td align="center">CE</td>
544 |         <td align="center">LA: 1.13 (6)</td>
545 |         <td align="center">LA: 0.038 (4)</td>
546 |     </tr>
547 |     <tr>
548 |         <td align="center">Spoofing attacker also benefits from self-supervised pretrained model <br> <a href="https://arxiv.org/abs/2305.15518">Paper</a></td>
549 |         <td align="center">—</td>
550 |         <td align="center">HuBERT, WavLM</td>
551 |         <td align="center">Residual block, Conv-TasNet</td>
552 |         <td align="center">AAM softmax</td>
553 |         <td align="center"><b>LA: 0.44 (1)</b></td>
554 |         <td align="center">—</td>
555 |     </tr>
556 | </table>
557 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.
558 | 
559 | 
560 | ## <span id="ff">Feature Fusion-based Forgery Detection</span>
561 | [**⬆ Back to top**](#table-of-contents)
562 | <table>
563 | 	<tr>
564 | 	    <td align="center" rowspan="2">Paper</td>
565 | 	    <td align="center" colspan="3" >Audio Deepfake Detection</td>
566 | 	    <td align="center" >Results</td>  
567 | 	</tr >
568 |     <tr>
569 |         <td align="center">Feature Extraction</td>
570 |         <td align="center">Network Structure</td>
571 |         <td align="center">Loss Function</td>
572 |         <td align="center">EER (%)</td>
573 |     </tr>
574 |     <tr>
575 |         <td align="center">Voice spoofing countermeasure for synthetic speech detection <br> <a href="https://ieeexplore.ieee.org/document/9445238">Paper</a></td>
576 |         <td align="center">GTCC, MFCC, Spectral Flux, Spectral Centroid</td>
577 |         <td align="center">Bi-LSTM</td>
578 |         <td align="center">—</td>
579 |         <td align="center">LA: 3.05 (4)</td>
580 |     </tr>
581 |     <tr>
582 |         <td align="center">Combining automatic speaker verification and prosody analysis for synthetic speech detection <br> <a href="https://arxiv.org/abs/2210.17222">Paper</a></td>
583 |         <td align="center">MFCC, Mel-Spectrogram</td>
584 |         <td align="center">ECAPA-TDNN, Prosody Encoder</td>
585 |         <td align="center">BCE</td>
586 |         <td align="center">LA: 5.39 (5)</td>
587 |     </tr>
588 |     <tr>
589 |         <td align="center">Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation <br> <a href="https://arxiv.org/abs/2202.12233">Paper</a></td>
590 |         <td align="center">Sinc Filter, Wav2Vec2</td>
591 |         <td align="center">AASIST</td>
592 |         <td align="center">Contrastive Loss, WCE</td>
593 |         <td align="center">—</td>
594 |     </tr>
595 |     <tr>
596 |         <td align="center">Overlapped frequency-distributed network: Frequency-aware voice spoofing countermeasure <br> <a href="https://www.isca-speech.org/archive/pdfs/interspeech_2022/choi22c_interspeech.pdf">Paper</a></td>
597 |         <td align="center">Mel-Spectrogram, CQT</td>
598 |         <td align="center">LCNN, ResNet</td>
599 |         <td align="center">—</td>
600 |         <td align="center">LA: 1.35 (2)<br>PA: 0.35</td>
601 |     </tr>
602 |     <tr>
603 |         <td align="center">Detection of cross-dataset fake audio based on prosodic and pronunciation features <br> <a href="https://arxiv.org/abs/2305.13700">Paper</a></td>
604 |         <td align="center">Phoneme Feature, Prosody Feature, Wav2Vec2</td>
605 |         <td align="center">LCNN, Bi-LSTM</td>
606 |         <td align="center">CTC</td>
607 |         <td align="center">LA: 1.58 (3)</td>
608 |     </tr>
609 |     <tr>
610 |         <td align="center">Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion <br> <a href="https://arxiv.org/abs/2305.16353">Paper</a> <a href="https://github.com/AI-S2-Lab/M2S-ADD">Code</a></td>
611 |         <td align="center">Sinc Filter</td>
612 |         <td align="center">AASIST, M2S Converter</td>
613 |         <td align="center">CE</td>
614 |         <td align="center"><b>LA: 1.34 (1)</b></td>
615 |     </tr>
616 | </table>
617 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.
618 | 
619 | # <span id="nt">Network Training</span>
620 | ## <span id="ml">Multi-task Learning-based Forgery Detection</span>
621 | [**⬆ Back to top**](#table-of-contents)
622 | <table>
623 | 	<tr>
624 | 	    <td align="center" rowspan="2">Paper</td>
625 | 	    <td align="center" colspan="3" >Audio Deepfake Detection</td>
626 | 	    <td align="center" colspan="2" >Results</td>  
627 | 	</tr >
628 |     <tr>
629 |         <td align="center">Feature Extraction</td>
630 |         <td align="center">Network Structure</td>
631 |         <td align="center">Loss Function</td>
632 |         <td align="center">EER (%)</td>
633 |         <td align="center">t-DCF</td>
634 |     </tr>
635 |     <tr>
636 |         <td align="center">Multi-task learning in utterance-level and segmental-level spoof detection <br> <a href="https://arxiv.org/abs/2107.14132">Paper</a></td>
637 |           <td align="center">LFCC</td>
638 |           <td align="center">SELCNN, Bi-LSTM</td>
639 |             <td align="center">P2SGrad</td>
640 |             <td align="center">—</td>
641 |             <td align="center">—</td>
642 |     </tr>
643 |     <tr>
644 |         <td align="center">SA-SASV: An end-to-end spoof-aggregated spoofing-aware speaker verification system <br> <a href="https://arxiv.org/abs/2203.06517">Paper</a> <a href="https://github.com/magnumresearchgroup/SA-SASV">Code</a></td>
645 |           <td align="center">Fbanks, Sinc Filter</td>
646 |           <td align="center">ECAPA-TDNN, ARawNet</td>
647 |             <td align="center">BCE, AAM Softmax, CE</td>
648 |             <td align="center">LA: 4.86 (4)</td>
649 |             <td align="center">—</td>
650 |     </tr>
651 |     <tr>
652 |         <td align="center">STATNet: Spectral and temporal features based multi-task network for audio spoofing detection <br> <a href="https://ieeexplore.ieee.org/document/10007949">Paper</a></td>
653 |         <td align="center">Sinc Filter</td>
654 |         <td align="center">RawNet2, TCM, SCM</td>
655 |         <td align="center">CE</td>
656 |         <td align="center">LA: 2.45 (3)</td>
657 |         <td align="center">LA: 0.062 (2)</td>
658 |     </tr>
659 |     <tr>
660 |         <td align="center">A probabilistic fusion framework for spoofing aware speaker verification <br> <a href="https://arxiv.org/abs/2202.05253">Paper</a> <a href="https://github.com/yzyouzhang/SASV_PRF">Code</a></td>
661 |         <td align="center">Mel Filter, Sinc Filter</td>
662 |         <td align="center">ECAPA-TDNN, AASIST</td>
663 |         <td align="center">BCE</td>
664 |         <td align="center">LA: 1.53 (2)</td>
665 |         <td align="center">—</td>
666 |     </tr>
667 |     <tr>
668 |         <td align="center">DSVAE: Interpretable disentangled representation for synthetic speech detection <br> <a href="https://arxiv.org/abs/2304.03323">Paper</a></td>
669 |           <td align="center">Spectrogram</td>
670 |           <td align="center">VAE</td>
671 |             <td align="center">KL Divergence Loss, BCE</td>
672 |             <td align="center">LA: 6.56 (5)</td>
673 |             <td align="center">—</td>
674 |     </tr>
675 |     <tr>
676 |         <td align="center">End-to-end dual-branch network towards synthetic speech detection <br> <a href="https://ieeexplore.ieee.org/document/10082951">Paper</a> <a href="https://github.com/imagecbj/End-to-End-Dual-Branch-Network-Towards-Synthetic-Speech-Detection">Code</a></td>
677 |           <td align="center">LFCC, CQT</td>
678 |           <td align="center">Dual-Branch Network</td>
679 |             <td align="center">Classification Loss, Fake Type Classification Loss</td>
680 |             <td align="center"><b>LA: 0.80 (1)</b></td>
681 |             <td align="center"><b>LA: 0.021 (1)</b></td>
682 |     </tr>
683 | </table>
684 | Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.
685 | 
686 | # Reference
687 | [**⬆ Back to top**](#table-of-contents)
688 | 
689 | If you find our repository useful for your research, please cite it as follows:
690 | ```
691 | @article{xu2024,
692 |   title={Research progress on speech deepfake and its detection techniques},
693 |   author={Xu Yuxiong, Li Bin, Tan Shunquan, Huang Jiwu},
694 |   journal={Journal of Image and Graphics},
695 |   volume={29},
696 |   number={8},
697 |   pages={2236--2268},
698 |   year={2024}
699 | }
700 | ```
701 | 
702 | # Statement
703 | [**⬆ Back to top**](#table-of-contents)
704 | 
705 | The purpose of this project is to establish a database based on  audio deepfake detection, solely for the purpose of communication and learning. All the content collected in this project is sourced from journals and the internet, and we express sincere gratitude to the researchers and authors who have published related research achievements.  In the event of a complaint of copyright infringement, the content will be removed as appropriate.
706 | 
707 | <!-- # Contact
708 | We are glad to hear from you. If you have any questions, please feel free to contact <xuyuxiong2022@email.szu.edu.cn>. -->
709 | 
710 | 
711 | 


--------------------------------------------------------------------------------