Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning

Authors

  • Chenyu Zhang School of New Media and Communication, Tianjin University, Tianjin, China
  • Lanjun Wang School of New Media and Communication, Tianjin University, Tianjin, China
  • Yiwen Ma School of Electrical and Information Engineering, Tianjin University, Tianjin, China
  • Wenhui Li School of Electrical and Information Engineering, Tianjin University, Tianjin, China
  • Guoqing Jin State Key Laboratory of Communication Content Cognition, People's Daily Online, Beijing, China
  • Anan Liu School of Electrical and Information Engineering, Tianjin University, Tianjin, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40919

Abstract

Text-to-Image (T2I) models typically deploy safety mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attack methods manually design instructions for the LLM to generate adversarial prompts, which effectively exposing safety vulnerabilities of T2I models. However, existing methods have two limitations: 1) relying on manually exhaustive strategies for designing adversarial prompts, lacking a unified framework, and 2) requiring numerous queries to achieve a successful attack, limiting their practical applicability. To address this issue, we propose Reason2Attack~(R2A), which aims to enhance the effectiveness and efficiency of the LLM in jailbreaking attacks. Specifically, we first use Frame Semantics theory to systematize existing manually crafted strategies and propose a unified generation framework to generate CoT adversarial prompts step by step. Following this, we propose a two-stage LLM reasoning training framework guided by the attack process. In the first stage, the LLM is fine-tuned with CoT examples generated by the unified generation framework to internalize the adversarial prompt generation process grounded in Frame Semantics. In the second stage, we incorporate the jailbreaking task into the LLM's reinforcement learning process, guided by the proposed attack process reward function that balances prompt stealthiness, effectiveness, and length, enabling the LLM to understand T2I models and safety mechanisms. Extensive experiments on various T2I models with safety mechanisms, and commercial T2I models, show the superiority and practicality of R2A.

Downloads

Published

2026-03-14

How to Cite

Zhang, C., Wang, L., Ma, Y., Li, W., Jin, G., & Liu, A. (2026). Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 36030–36038. https://doi.org/10.1609/aaai.v40i42.40919

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI