DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Authors

  • Andrew Zhao Tsinghua University
  • Quentin Xu Tsinghua University
  • Matthieu Lin Tsinghua University
  • Shenzhi Wang Tsinghua University
  • Yong-Jin Liu Tsinghua University
  • Zilong Zheng Beijing Institute for General Artificial Intelligence
  • Gao Huang Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i24.34797

Abstract

Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment. WARNING: This paper contains examples of potentially harmful text.

Published

2025-04-11

How to Cite

Zhao, A., Xu, Q., Lin, M., Wang, S., Liu, Y.-J., Zheng, Z., & Huang, G. (2025). DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 26021–26030. https://doi.org/10.1609/aaai.v39i24.34797

Issue

Section

AAAI Technical Track on Natural Language Processing III