Perception-Guided Jailbreak Against Text-to-Image Models

Authors

  • Yihao Huang Nanyang Technological University
  • Le Liang East China Normal University
  • Tianlin Li Nanyang Technological University
  • Xiaojun Jia Nanyang Technological University Key Laboratory of Cyberspace Security, Ministry of Education, China
  • Run Wang Wuhan University
  • Weikai Miao East China Normal University
  • Geguang Pu East China Normal University Shanghai Trusted Industrial Control Platform Co.,Ltd., China
  • Yang Liu Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v39i25.34821

Abstract

In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

Downloads

Published

2025-04-11

How to Cite

Huang, Y., Liang, L., Li, T., Jia, X., Wang, R., Miao, W., … Liu, Y. (2025). Perception-Guided Jailbreak Against Text-to-Image Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(25), 26238–26247. https://doi.org/10.1609/aaai.v39i25.34821

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI