Manipulating the Mind’s Eye: A-SAGE, the Attention-Based Attack on ViT Explainability

Boshi Zheng; Yan Li; Jiabin Liu

doi:10.1609/aaai.v40i16.38340

Authors

Boshi Zheng Beijing Institute of Technology
Yan Li Beijing Institute of Technology
Jiabin Liu Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i16.38340

Abstract

The rise of Vision Transformers (ViTs) as cornerstone models in safety-critical applications like autonomous driving and medical diagnosis has shifted the focus from pure accuracy to verifiable trustworthiness. However, the very mechanisms used to explain these models, their internal attention maps, are themselves vulnerable. This creates a critical "trust gap," as the model's apparent reasoning can be maliciously manipulated. To systematically investigate this vulnerability, we introduce A-SAGE (Attention-based Steering Adversarial Generation by Corrupting Explanations), a dual-objective attack framework that forces a model to misclassify an input while simultaneously corrupting its internal attention patterns to generate a misleading explanation. A-SAGE achieves this by optimizing a unified loss that combines a standard classification objective with two explanation-specific terms: an attention entropy loss to diffuse the model's focus and an attention map distortion loss to steer the corrupted explanation towards a desired target. Our primary finding is A-SAGE's exceptional black-box transferability. Using a CaiT-S as a white-box surrogate, adversarial examples generated with imperceptible perturbations achieve attack success rates of 79.4% on ViT-B, 49.7% on ResNet-50, and over 81.5% on other transformers (DeiT-B,TNT-S). Crucially, these successful attacks do not merely destroy the explanation; they generate a coherent but false attention map that deceptively "justifies" the wrong prediction. These results reveal a systemic vulnerability in the core reasoning of modern foundation models, establishing A-SAGE as a critical benchmark for auditing the robustness of AI explainability.

Manipulating the Mind’s Eye: A-SAGE, the Attention-Based Attack on ViT Explainability

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information