SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Authors

  • Lai Jiang School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai 200240, China
  • Yuekang Li University of New South Wales, Sydney 2052, Australia
  • Xiaohan Zhang School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai 200240, China
  • Youtao Ding School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai 200240, China
  • Li Pan School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai 200240, China Zhangjiang Institute for Advanced Study, Shanghai 201203, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40866

Abstract

Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.

Downloads

Published

2026-03-14

How to Cite

Jiang, L., Li, Y., Zhang, X., Ding, Y., & Pan, L. (2026). SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35553–35561. https://doi.org/10.1609/aaai.v40i42.40866

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI