Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models

Haojie Hao; Jiakai Wang; Aishan Liu; Yuqing Ma; Haotong Qin; Yuanfang Guo; Xianglong Liu

doi:10.1609/aaai.v40i42.40858

Authors

Haojie Hao State Key Laboratory of Complex & Critical Software Environment, Beihang University
Jiakai Wang Zhongguancun Laboratory
Aishan Liu State Key Laboratory of Complex & Critical Software Environment, Beihang University
Yuqing Ma Institute of Artificial Intelligence, Beihang University
Haotong Qin Department of Information Technology and Electrical Engineering, ETH Zurich
Yuanfang Guo State Key Laboratory of Complex & Critical Software Environment, Beihang University
Xianglong Liu State Key Laboratory of Complex & Critical Software Environment, Beihang University Zhongguancun Laboratory Institute of Dataspace, Hefei, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40858

Abstract

Recently, Large Vision-Language Models (LVLMs) have been demonstrated to be vulnerable to jailbreak attacks, highlighting the urgent need for further research to comprehensively identify and mitigate these threats. Unfortunately, existing jailbreak studies primarily focus on coarse-grained input manipulation to elicit specific responses, overlooking the exploitation of internal representations, i.e., intermediate activations, which constrains their ability to penetrate alignment safeguards and generate harmful responses. To tackle this issue, we propose the Activation Manipulation (ActMan) Attack framework, which performs fine-grained activation manipulations inspired by the perception and cognition stages of human decision-making, enhancing both the penetration capability and harmfulness of attacks. To improve penetration capability, we introduce a Deceptive Visual Camouflage module inspired by the masking effect in human perception. This module uses a benign activation-guided attention redirection strategy to conceal abnormal activation patterns, thereby suppressing LVLM's defense detection during early-stage decoding. To enhance harmfulness, we design a Malicious Semantic Induction module drawing from the framing effect in human cognition, which reconstructs jailbreak instructions using malicious activation guidance to change LVLM’s risk assessment during late-stage decoding, thereby amplifying the harmfulness of model responses. Extensive experiments on six mainstream LVLMs demonstrate that our method remarkably outperforms state-of-the-art baselines, achieving an average relative ASR improvement of 12.06%.

Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information