Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning

Chenyu Zhang; Lanjun Wang; Yiwen Ma; Wenhui Li; Guoqing Jin; Anan Liu

doi:10.1609/aaai.v40i42.40919

Authors

Chenyu Zhang School of New Media and Communication, Tianjin University, Tianjin, China
Lanjun Wang School of New Media and Communication, Tianjin University, Tianjin, China
Yiwen Ma School of Electrical and Information Engineering, Tianjin University, Tianjin, China
Wenhui Li School of Electrical and Information Engineering, Tianjin University, Tianjin, China
Guoqing Jin State Key Laboratory of Communication Content Cognition, People's Daily Online, Beijing, China
Anan Liu School of Electrical and Information Engineering, Tianjin University, Tianjin, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40919

Abstract

Text-to-Image (T2I) models typically deploy safety mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods manually design instructions for the LLM to generate adversarial prompts, which effectively exposing safety vulnerabilities of T2I models. However, existing methods have two limitations: 1) relying on manually exhaustive strategies for designing adversarial prompts, lacking a unified framework, and 2) requiring numerous queries to achieve a successful attack, limiting their practical applicability. To address this issue, we propose Reason2Attack~(R2A), which aims to enhance the effectiveness and efficiency of the LLM in jailbreaking attacks. Specifically, we first use Frame Semantics theory to systematize existing manually crafted strategies and propose a unified generation framework to generate CoT adversarial prompts step by step. Following this, we propose a two-stage LLM reasoning training framework guided by the attack process. In the first stage, the LLM is fine-tuned with CoT examples generated by the unified generation framework to internalize the adversarial prompt generation process grounded in Frame Semantics. In the second stage, we incorporate the jailbreaking task into the LLM's reinforcement learning process, guided by the proposed attack process reward function that balances prompt stealthiness, effectiveness, and length, enabling the LLM to understand T2I models and safety mechanisms. Extensive experiments on various T2I models with safety mechanisms, and commercial T2I models, show the superiority and practicality of R2A.

Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information