TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

Fengji Ma; Hei Victor Cheng; Chenxing Li; Li Liu

doi:10.1609/aaai.v40i29.39603

Authors

Fengji Ma The Hong Kong University of Science and Technology (Guangzhou)
Hei Victor Cheng Aarhus University
Chenxing Li Tencent
Li Liu The Hong Kong University of Science and Technology (Guangzhou)

DOI:

https://doi.org/10.1609/aaai.v40i29.39603

Abstract

Achieving zero-shot adversarial robustness without sacrificing generalization remains challenging for foundation models such as CLIP, especially under large adversarial perturbations. Through empirical analyses, we identify three critical yet overlooked issues: (1) Logit margins exhibit a stable offset between small and large adversarial perturbations, suggesting that explicitly adjusting margins could improve robustness against unseen large perturbations. (2) A significant negative correlation exists between logit margin and inter-class semantic similarity, indicating that semantic structures are insufficiently leveraged by existing methods. (3) Existing methods for adjusting text embeddings disrupt the intrinsic semantic consistency established by pre-trained models, undermining generalization capability. Motivated by these findings, we propose a novel Text-Image Mutual Awareness (TIMA) framework, including a Text-Aware Image (TAI) tuning module with an Adaptive Semantic-Aware Margin (ASAM) to explicitly calibrate logit margins, and an Image-Aware Text (IAT) tuning module with Semantic Consistent Minimum Hyperspherical Energy (SC-MHE) to preserve semantic consistency. Comprehensive experiments validate that TIMA significantly outperforms existing approaches by effectively addressing the identified limitations.

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information