SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
DOI:
https://doi.org/10.1609/aaai.v40i42.40866Abstract
Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.Published
2026-03-14
How to Cite
Jiang, L., Li, Y., Zhang, X., Ding, Y., & Pan, L. (2026). SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35553–35561. https://doi.org/10.1609/aaai.v40i42.40866
Issue
Section
AAAI Technical Track on Philosophy and Ethics of AI