DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Yitong Zhang; Jia Li; Liyi Cai; Ge Li

doi:10.1609/aaai.v40i44.41149

Authors

Yitong Zhang College of AI, Tsinghua University School of Computer Science and Engineering, Beihang University
Jia Li College of AI, Tsinghua University
Liyi Cai School of Computer Science, Peking University
Ge Li School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i44.41149

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries. Existing safety alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose DAVSP, which is built upon two key innovations. First, we introduce Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential to the overall effectiveness.

DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information