SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Muxi Diao; Rumei Li; Shiyang Liu; Guogang Liao; Jingang Wang; Xunliang Cai; Weiran Xu

doi:10.1609/aaai.v39i22.34549

Authors

Muxi Diao Beijing University of Posts and Telecommunications
Rumei Li Meituan
Shiyang Liu Meituan
Guogang Liao Meituan
Jingang Wang Meituan
Xunliang Cai Meituan
Weiran Xu Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v39i22.34549

Abstract

As Large Language Models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models. To tackle these challenges, we introduce the Self-Evolving Adversarial Safety (SEAS) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its security. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS.

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information