MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

Authors

  • Boyuan Chen New York University Abu Dhabi New York University Tandon School of Engineering
  • Minghao Shao New York University Abu Dhabi New York University Tandon School of Engineering
  • Abdul Basit New York University Abu Dhabi
  • Siddharth Garg New York University Tandon School of Engineering
  • Muhammad Shafique New York University Abu Dhabi

DOI:

https://doi.org/10.1609/aaai.v40i44.41058

Abstract

Large language models (LLMs) face persistent vulnerability to jailbreak attacks despite their increasing capabilities. While developers deploy alignment finetuning and safety guardrails, researchers consistently devise novel attacks that circumvent these defenses. This dynamic mirrors a strategic game of continual evolution. However, two challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and impact. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large-scale empirical evaluation across diverse victim models, demonstrating its robustness and adaptability.

Published

2026-03-14

How to Cite

Chen, B., Shao, M., Basit, A., Garg, S., & Shafique, M. (2026). MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37268–37276. https://doi.org/10.1609/aaai.v40i44.41058

Issue

Section

AAAI Special Track on AI Alignment