Manipulating the Mind’s Eye: A-SAGE, the Attention-Based Attack on ViT Explainability

Authors

  • Boshi Zheng Beijing Institute of Technology
  • Yan Li Beijing Institute of Technology
  • Jiabin Liu Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i16.38340

Abstract

The rise of Vision Transformers (ViTs) as cornerstone models in safety-critical applications like autonomous driving and medical diagnosis has shifted the focus from pure accuracy to verifiable trustworthiness. However, the very mechanisms used to explain these models, their internal attention maps, are themselves vulnerable. This creates a critical "trust gap," as the model's apparent reasoning can be maliciously manipulated. To systematically investigate this vulnerability, we introduce A-SAGE (Attention-based Steering Adversarial Generation by Corrupting Explanations), a dual-objective attack framework that forces a model to misclassify an input while simultaneously corrupting its internal attention patterns to generate a misleading explanation. A-SAGE achieves this by optimizing a unified loss that combines a standard classification objective with two explanation-specific terms: an attention entropy loss to diffuse the model's focus and an attention map distortion loss to steer the corrupted explanation towards a desired target. Our primary finding is A-SAGE's exceptional black-box transferability. Using a CaiT-S as a white-box surrogate, adversarial examples generated with imperceptible perturbations achieve attack success rates of 79.4% on ViT-B, 49.7% on ResNet-50, and over 81.5% on other transformers (DeiT-B,TNT-S). Crucially, these successful attacks do not merely destroy the explanation; they generate a coherent but false attention map that deceptively "justifies" the wrong prediction. These results reveal a systemic vulnerability in the core reasoning of modern foundation models, establishing A-SAGE as a critical benchmark for auditing the robustness of AI explainability.

Downloads

Published

2026-03-14

How to Cite

Zheng, B., Li, Y., & Liu, J. (2026). Manipulating the Mind’s Eye: A-SAGE, the Attention-Based Attack on ViT Explainability. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13369–13377. https://doi.org/10.1609/aaai.v40i16.38340

Issue

Section

AAAI Technical Track on Computer Vision XIII