Optimizing Against Safety Representations: Activation-Guided Adversarial Suffixes and the Geometry of Refusal
DOI:
https://doi.org/10.1609/aaaiss.v9i1.42902Abstract
Behavioral alignment in large language models often masks fragile internal safety representations. Recent work suggests that refusal behavior is mediated by low-dimensional directions in activation space. This raises questions about how such representations are structured, localized, and accessed by optimization. We study adversarial suffix attacks as a probe of representational alignment. We introduce Activation-Guided GCG, which replaces output-based objectives with losses that directly target a model's internal refusal direction. Across several objective variants, we find that suppressing refusal globally across all layers and positions is more effective than targeting a single layer–position pair. This suggests that safety representations are distributed across the forward pass rather than causally localized to a single site. We further introduce Soft-GCG, a continuous relaxation of discrete suffix optimization using Gumbel-Softmax. Soft-GCG achieves a 33x speedup over standard GCG while improving attack success rates. Evaluating across model scales, we find that smaller models remain vulnerable while larger models resist both activation- and suffix-based attacks at our compute-constrained settings, consistent with larger and better safety trained models being harder to jailbreak. Together, our results clarify how safety mechanisms are encoded and can be broken in contemporary models. These insights provide concrete guidance for designing more robust and representation-aware alignment strategies.Downloads
Published
2026-06-23
How to Cite
Cakar, E., Guan, H., & Kehe, K. (2026). Optimizing Against Safety Representations: Activation-Guided Adversarial Suffixes and the Geometry of Refusal. Proceedings of the AAAI Symposium Series, 9(1), 27–34. https://doi.org/10.1609/aaaiss.v9i1.42902
Issue
Section
AI-Driven Resilience: Building Robust, Adaptive Technologies for a Dynamic World (Full Papers)