Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal

Nirmalendu Prakash; Yeo Wei Jie; Amir Abdullah; Ranjan Satapathy; Erik Cambria; Roy Ka-Wei Lee

doi:10.1609/aaai.v40i44.41119

Authors

Nirmalendu Prakash Singapore University of Technology and Design
Yeo Wei Jie School of Computer Science and Engineering, Nanyang Technological University
Amir Abdullah Thoughtworks
Ranjan Satapathy Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A∗STAR)
Erik Cambria Nanyang Technological University
Roy Ka-Wei Lee Singapore University of Technology and Design

DOI:

https://doi.org/10.1609/aaai.v40i44.41119

Abstract

Refusal on harmful prompts is a key safety behaviour in instruction‑tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction tuned models—Gemma‑2-2B‑IT and LLaMA‑3.1-8B‑IT using sparse autoencoders (SAEs) trained on residual‑stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: 1. Refusal Direction - Finding a refusal mediating direction and collecting SAE features close to that direction, followed by 2. Greedy Filtering - to prune this set to obtain a minimal set and finally 3. Interaction Discovery - a factorization‑machine (FM) model that captures non‑linear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we also find evidence of redundant features which remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information