EigenShield: Inference-Time, Model-Agnostic Jailbreaking Defense via Causal Subspace Filtering
DOI:
https://doi.org/10.1609/aaai.v40i5.37350Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to adversarial attacks despite widespread adoption. Existing defenses typically require retraining, rely on heuristics, or fail under adaptive and out-of-distribution (OOD) conditions. We introduce EigenShield, a principled, inference-time, architecture-agnostic defense that leverages Random Matrix Theory (RMT) to suppress adversarial noise in high-dimensional embeddings. EigenShield uses spiked covariance modeling and a Robustness-based Nonconformity Score (RbNS) with quantile thresholding to isolate and preserve causal eigenvectors, filtering out adversarial components without model access or adversarial training. We develop a theoretical framework establishing conditions for asymptotic noise suppression and demonstrate effectiveness in both unimodal and multimodal settings. Empirically, EigenShield consistently improves robustness across threat models, reducing attack success rates (ASR) by up to 48% over state-of-the-art defenses, including adversarial training, UNIGUARD, CIDER, and input transformations. On jailbreak attacks, EigenShield lowers LLM ASR by up to 92.9% relative to undefended models. Under multimodal adversarial attacks, it reduces VLM ASR by up to 76.5%. Against adaptive attacks on LLMs, it achieves ASR reductions of up to 77.7%. In OOD settings, EigenShield maintains strong performance, reducing ASR by up to 88.4% for LLMs and 80.4% for VLMs.Downloads
Published
2026-03-14
How to Cite
Darabi, N., Naik, D., Tayebati, S., Jayasuriya, D., Krishnan, R., & Trivedi, A. R. (2026). EigenShield: Inference-Time, Model-Agnostic Jailbreaking Defense via Causal Subspace Filtering. Proceedings of the AAAI Conference on Artificial Intelligence, 40(5), 3524-3532. https://doi.org/10.1609/aaai.v40i5.37350
Issue
Section
AAAI Technical Track on Computer Vision II