SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
DOI:
https://doi.org/10.1609/aaai.v39i22.34549Abstract
As Large Language Models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models. To tackle these challenges, we introduce the Self-Evolving Adversarial Safety (SEAS) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its security. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS.Published
2025-04-11
How to Cite
Diao, M., Li, R., Liu, S., Liao, G., Wang, J., Cai, X., & Xu, W. (2025). SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22), 23778–23786. https://doi.org/10.1609/aaai.v39i22.34549
Issue
Section
AAAI Technical Track on Natural Language Processing I