DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
DOI:
https://doi.org/10.1609/aaai.v40i44.41149Abstract
Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries. Existing safety alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose DAVSP, which is built upon two key innovations. First, we introduce Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential to the overall effectiveness.Downloads
Published
2026-03-14
How to Cite
Zhang, Y., Li, J., Cai, L., & Li, G. (2026). DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 38111–38119. https://doi.org/10.1609/aaai.v40i44.41149
Issue
Section
AAAI Special Track on AI Alignment