EigenShield: Inference-Time, Model-Agnostic Jailbreaking Defense via Causal Subspace Filtering

Authors

  • Nastaran Darabi University of Illinois at Chicago
  • Devashri Naik University of Illinois at Chicago
  • Sina Tayebati University of Illinois at Chicago
  • Dinithi Jayasuriya University of Illinois at Chicago
  • Ranganath Krishnan Capital One, AI Labs
  • Amit Ranjan Trivedi University of Illinois at Chicago

DOI:

https://doi.org/10.1609/aaai.v40i5.37350

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to adversarial attacks despite widespread adoption. Existing defenses typically require retraining, rely on heuristics, or fail under adaptive and out-of-distribution (OOD) conditions. We introduce EigenShield, a principled, inference-time, architecture-agnostic defense that leverages Random Matrix Theory (RMT) to suppress adversarial noise in high-dimensional embeddings. EigenShield uses spiked covariance modeling and a Robustness-based Nonconformity Score (RbNS) with quantile thresholding to isolate and preserve causal eigenvectors, filtering out adversarial components without model access or adversarial training. We develop a theoretical framework establishing conditions for asymptotic noise suppression and demonstrate effectiveness in both unimodal and multimodal settings. Empirically, EigenShield consistently improves robustness across threat models, reducing attack success rates (ASR) by up to 48% over state-of-the-art defenses, including adversarial training, UNIGUARD, CIDER, and input transformations. On jailbreak attacks, EigenShield lowers LLM ASR by up to 92.9% relative to undefended models. Under multimodal adversarial attacks, it reduces VLM ASR by up to 76.5%. Against adaptive attacks on LLMs, it achieves ASR reductions of up to 77.7%. In OOD settings, EigenShield maintains strong performance, reducing ASR by up to 88.4% for LLMs and 80.4% for VLMs.

Downloads

Published

2026-03-14

How to Cite

Darabi, N., Naik, D., Tayebati, S., Jayasuriya, D., Krishnan, R., & Trivedi, A. R. (2026). EigenShield: Inference-Time, Model-Agnostic Jailbreaking Defense via Causal Subspace Filtering. Proceedings of the AAAI Conference on Artificial Intelligence, 40(5), 3524-3532. https://doi.org/10.1609/aaai.v40i5.37350

Issue

Section

AAAI Technical Track on Computer Vision II