Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak

Authors

  • Haoxuan Ji Xi'an Jiaotong University
  • Zheng Lin Xidian University
  • Zhenxing Niu Xidian University
  • Xinbo Gao Xidian University
  • Gang Hua Amazon.com

DOI:

https://doi.org/10.1609/aaai.v40i42.40863

Abstract

This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLMs. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.

Downloads

Published

2026-03-14

How to Cite

Ji, H., Lin, Z., Niu, Z., Gao, X., & Hua, G. (2026). Efficient LLM-Jailbreaking via Multimodal-LLM Jailbreak. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35527–35535. https://doi.org/10.1609/aaai.v40i42.40863

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI