Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Authors

  • Yuankun Xie State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
  • Ruibo Fu Institute of Automation, Chinese Academy of Sciences, Beijing, China
  • Xiaopeng Wang School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Zhiyong Wang School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
  • Songjun Cao YouTu Lab, Tencent, Beijing, China
  • Long Ma YouTu Lab, Tencent, Beijing, China
  • Haonan Cheng State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
  • Long Ye State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40907

Abstract

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458× fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

Downloads

Published

2026-03-14

How to Cite

Xie, Y., Fu, R., Wang, X., Wang, Z., Cao, S., Ma, L., … Ye, L. (2026). Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35922–35930. https://doi.org/10.1609/aaai.v40i42.40907

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI