SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models

Authors

  • Giorgio Piras University of Cagliari, Italy
  • Raffaele Mura University of Cagliari, Italy
  • Fabio Brau University of Cagliari, Italy
  • Luca Oneto Università degli Studi di Genova, Italy
  • Fabio Roli Università degli Studi di Genova, Italy
  • Battista Biggio University of Cagliari, Italy

DOI:

https://doi.org/10.1609/aaai.v40i39.40551

Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

Published

2026-03-14

How to Cite

Piras, G., Mura, R., Brau, F., Oneto, L., Roli, F., & Biggio, B. (2026). SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 32728–32736. https://doi.org/10.1609/aaai.v40i39.40551

Issue

Section

AAAI Technical Track on Natural Language Processing IV