Time Shuffle: A Transferability-Booster for Multiple Audio Adversarial Tasks

JiaCheng Deng; Dengpan Ye; Yuhong Liu; Zhaolin Wei; Ziyi Liu; Haoran Duan

doi:10.1609/aaai.v40i42.40849

Authors

JiaCheng Deng Wuhan University, School of Cyber Science and Engineering, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan, 430072, China
Dengpan Ye Wuhan University, School of Cyber Science and Engineering, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan, 430072, China Cyberspace Institute of Advanced Technology, Guangzhou University, Guangdong, 510006, China
Yuhong Liu Wuhan University, School of Cyber Science and Engineering, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan, 430072, China
Zhaolin Wei Wuhan University, School of Cyber Science and Engineering, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan, 430072, China
Ziyi Liu Wuhan University, School of Cyber Science and Engineering, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan, 430072, China
Haoran Duan Wuhan University, School of Cyber Science and Engineering, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan, 430072, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40849

Abstract

Existing audio adversarial attack methods suffer from poor transferability, primarily due to insufficient exploration of model decision mechanisms and overreliance on heuristic-driven algorithm design. This paper aims to alleviate this gap. Specifically, through observations across three mainstream audio tasks (Automatic Speech Recognition, Speaker Verification, and Keyword Spotting), we reveal that these models primarily rely on local temporal features—inputs with time shuffled retain 83.7% of original accuracy. The SHAP-based visualization further validated that time shuffle leads to a significant shift in the salient regions of the model, but the samples can still be correctly identified, indicating the presence of redundant features that can affect decision-making. Inspired by these findings, we propose Time-Shuffle (TS) adversarial attack (including segments-based TS and phoneme-level-based TS-p). This method divides audio or phonemes into segments, randomly shuffles them, and computes gradients on the shuffled structure. By forcing perturbations to exploit transferable local temporal features and reduce overfitting to source-specific patterns, TS/TS-p inherently enhances transferability. As a model-agnostic framework, TS/TS-p can seamlessly integrate with existing attack methods. Comprehensive experiments demonstrate that TS-p achieved SOTA and boosts transferability by about 23%/14.7%/6.3% on ASR/ASV/KWS.

Time Shuffle: A Transferability-Booster for Multiple Audio Adversarial Tasks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information