Optimizing Against Safety Representations: Activation-Guided Adversarial Suffixes and the Geometry of Refusal

Ege Cakar; Hannah Guan; Kayden Kehe

doi:10.1609/aaaiss.v9i1.42902

Authors

Ege Cakar Harvard University
Hannah Guan Harvard University
Kayden Kehe Harvard University

DOI:

https://doi.org/10.1609/aaaiss.v9i1.42902

Abstract

Behavioral alignment in large language models often masks fragile internal safety representations. Recent work suggests that refusal behavior is mediated by low-dimensional directions in activation space. This raises questions about how such representations are structured, localized, and accessed by optimization. We study adversarial suffix attacks as a probe of representational alignment. We introduce Activation-Guided GCG, which replaces output-based objectives with losses that directly target a model's internal refusal direction. Across several objective variants, we find that suppressing refusal globally across all layers and positions is more effective than targeting a single layer–position pair. This suggests that safety representations are distributed across the forward pass rather than causally localized to a single site. We further introduce Soft-GCG, a continuous relaxation of discrete suffix optimization using Gumbel-Softmax. Soft-GCG achieves a 33x speedup over standard GCG while improving attack success rates. Evaluating across model scales, we find that smaller models remain vulnerable while larger models resist both activation- and suffix-based attacks at our compute-constrained settings, consistent with larger and better safety trained models being harder to jailbreak. Together, our results clarify how safety mechanisms are encoded and can be broken in contemporary models. These insights provide concrete guidance for designing more robust and representation-aware alignment strategies.

Optimizing Against Safety Representations: Activation-Guided Adversarial Suffixes and the Geometry of Refusal

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information