Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models

Authors

  • Haojie Hao State Key Laboratory of Complex & Critical Software Environment, Beihang University
  • Jiakai Wang Zhongguancun Laboratory
  • Aishan Liu State Key Laboratory of Complex & Critical Software Environment, Beihang University
  • Yuqing Ma Institute of Artificial Intelligence, Beihang University
  • Haotong Qin Department of Information Technology and Electrical Engineering, ETH Zurich
  • Yuanfang Guo State Key Laboratory of Complex & Critical Software Environment, Beihang University
  • Xianglong Liu State Key Laboratory of Complex & Critical Software Environment, Beihang University Zhongguancun Laboratory Institute of Dataspace, Hefei, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40858

Abstract

Recently, Large Vision-Language Models (LVLMs) have been demonstrated to be vulnerable to jailbreak attacks, highlighting the urgent need for further research to comprehensively identify and mitigate these threats. Unfortunately, existing jailbreak studies primarily focus on coarse-grained input manipulation to elicit specific responses, overlooking the exploitation of internal representations, i.e., intermediate activations, which constrains their ability to penetrate alignment safeguards and generate harmful responses. To tackle this issue, we propose the Activation Manipulation (ActMan) Attack framework, which performs fine-grained activation manipulations inspired by the perception and cognition stages of human decision-making, enhancing both the penetration capability and harmfulness of attacks. To improve penetration capability, we introduce a Deceptive Visual Camouflage module inspired by the masking effect in human perception. This module uses a benign activation-guided attention redirection strategy to conceal abnormal activation patterns, thereby suppressing LVLM's defense detection during early-stage decoding. To enhance harmfulness, we design a Malicious Semantic Induction module drawing from the framing effect in human cognition, which reconstructs jailbreak instructions using malicious activation guidance to change LVLM’s risk assessment during late-stage decoding, thereby amplifying the harmfulness of model responses. Extensive experiments on six mainstream LVLMs demonstrate that our method remarkably outperforms state-of-the-art baselines, achieving an average relative ASR improvement of 12.06%.

Published

2026-03-14

How to Cite

Hao, H., Wang, J., Liu, A., Ma, Y., Qin, H., Guo, Y., & Liu, X. (2026). Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35481–35489. https://doi.org/10.1609/aaai.v40i42.40858

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI