DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Authors

  • Yitong Zhang College of AI, Tsinghua University School of Computer Science and Engineering, Beihang University
  • Jia Li College of AI, Tsinghua University
  • Liyi Cai School of Computer Science, Peking University
  • Ge Li School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i44.41149

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries. Existing safety alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose DAVSP, which is built upon two key innovations. First, we introduce Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential to the overall effectiveness.

Downloads

Published

2026-03-14

How to Cite

Zhang, Y., Li, J., Cai, L., & Li, G. (2026). DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 38111–38119. https://doi.org/10.1609/aaai.v40i44.41149

Issue

Section

AAAI Special Track on AI Alignment