DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Andrew Zhao; Quentin Xu; Matthieu Lin; Shenzhi Wang; Yong-Jin Liu; Zilong Zheng; Gao Huang

doi:10.1609/aaai.v39i24.34797

Authors

Andrew Zhao Tsinghua University
Quentin Xu Tsinghua University
Matthieu Lin Tsinghua University
Shenzhi Wang Tsinghua University
Yong-Jin Liu Tsinghua University
Zilong Zheng Beijing Institute for General Artificial Intelligence
Gao Huang Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i24.34797

Abstract

Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment. WARNING: This paper contains examples of potentially harmful text.

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information