Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract)

Abhilekh Borah; Niranjan Chebrolu; Kokil Jaidka

doi:10.1609/aaai.v40i48.42191

Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract)

Authors

Abhilekh Borah National University of Singapore
Niranjan Chebrolu National University of Singapore
Kokil Jaidka National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v40i48.42191

Abstract

Refusals must be resilient, not brittle.” Yet guarding refusals against adversarial phrasing and shifting user contexts remains difficult: large language models (LLMs) still yield to jailbreak prompts that evade safety filters and surface harmful content. We propose Refusal Activation Steering (RAS), a training-free, inference-time method that uses contrastive activations to shift LLM responses, biasing generation trajectories toward refusals without altering model weights. The approach is modular and domain-targetable, avoiding collateral refusals on benign queries while strengthening activation- space boundaries for unsafe content. On adversarial evaluations with an 8B instruction-tuned model, we find that steering improves refusal rate by ∼ 52% and reduces attack success rate by ∼ 40%, establishing a lightweight and interpretable safety layer for robust refusal consistency. To foster further research in this domain, we have made our implementation publicly available.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

PDF
Poster

Published

2026-03-14

How to Cite

Borah, A., Chebrolu, N., & Jaidka, K. (2026). Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41140–41142. https://doi.org/10.1609/aaai.v40i48.42191

Download Citation

Issue

Vol. 40 No. 48: EAAI-26 AI for Education, Model AI Assignments, AAAI-26 Emerging Trends, Doctoral Consortium, Student Abstracts, Undergraduate Consortium and Demonstrations

Section

AAAI Student Abstract and Poster Program

Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract)

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information