LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)

Piyush Jha; Arnav Arora; Vijay Ganesh

doi:10.1609/aaai.v39i28.35263

LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)

Authors

Piyush Jha Georgia Institute of Technology
Arnav Arora Georgia Institute of Technology
Vijay Ganesh Georgia Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v39i28.35263

Abstract

We introduce LLM Stinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLM Stinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLM Stinger across open and closed-source models.

AAAI-25 / IAAI-25 / EAAI-25 Proceedings Cover

Downloads

Published

2025-04-11

How to Cite

Jha, P., Arora, A., & Ganesh, V. (2025). LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29393–29395. https://doi.org/10.1609/aaai.v39i28.35263

Download Citation

Issue

Vol. 39 No. 28: IAAI-25, EAAI-25, AAAI-25 Student Abstracts, Undergraduate Consortium and Demonstrations

Section

AAAI Student Abstract and Poster Program

LLM Stinger: Jailbreaking LLMs Using RL Fine-Tuned LLMs (Student Abstract)

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information