MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

Boyuan Chen; Minghao Shao; Abdul Basit; Siddharth Garg; Muhammad Shafique

doi:10.1609/aaai.v40i44.41058

Authors

Boyuan Chen New York University Abu Dhabi New York University Tandon School of Engineering
Minghao Shao New York University Abu Dhabi New York University Tandon School of Engineering
Abdul Basit New York University Abu Dhabi
Siddharth Garg New York University Tandon School of Engineering
Muhammad Shafique New York University Abu Dhabi

DOI:

https://doi.org/10.1609/aaai.v40i44.41058

Abstract

Large language models (LLMs) face persistent vulnerability to jailbreak attacks despite their increasing capabilities. While developers deploy alignment finetuning and safety guardrails, researchers consistently devise novel attacks that circumvent these defenses. This dynamic mirrors a strategic game of continual evolution. However, two challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and impact. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large-scale empirical evaluation across diverse victim models, demonstrating its robustness and adaptability.

MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information