ShieldRAG: Safeguarding Retrieval-Augmented Generation from Untrusted Knowledge Bases

Authors

  • Peiru Yang Tsinghua University
  • Haoran Zheng Beijing University of Posts and Telecommunications
  • Yi Luo Beijing University of Posts and Telecommunications
  • Xinyi Liu Tsinghua University
  • Jinrui Wang Beijing University of Posts and Telecommunications
  • Huili Wang Tsinghua University
  • Xintian Li Tsinghua University
  • Yongfeng Huang Tsinghua University
  • Tao Qi Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i40.40725

Abstract

Open knowledge bases (e.g., websites) are widely adopted in Retrieval-Augmented Generation (RAG) systems to provide supplementary knowledge (e.g., latest information). However, such sources inevitably contain biased or harmful content, and incorporating these untrusted contents into the RAG process introduces significant safety risks, including the degradation of LLM performance and the potential generation of harmful outputs. Recent studies have shown that this vulnerability can be further amplified by adversarial poisoning attacks specifically targeting the knowledge sources. Most existing methods primarily emphasize improving the accuracy and efficiency of RAG systems, usually overlooking these critical safety concerns. In this paper, we propose a safety-aware retrieval framework (ShieldRAG) designed to augment language model generation by jointly optimizing for both relevance and safety in the retrieved knowledge content. The core idea of ShieldRAG is to transfer the safety knowledge implicitly encoded in powerful LLMs into the retriever model through an adversarial knowledge alignment mechanism. This can empower the retriever with the safety awareness, and adapt to the diverse and unknown distribution of unsafe content encountered in practical scenarios. We evaluate ShieldRAG on seven real-world datasets using five widely-used LLMs and two state-of-the-art poisoning attack strategies. Experimental results show that our method substantially improves the robustness of RAG systems against unsafe knowledge sources, while maintaining competitive performance in terms of generation accuracy and efficiency.

Downloads

Published

2026-03-14

How to Cite

Yang, P., Zheng, H., Luo, Y., Liu, X., Wang, J., Wang, H., … Qi, T. (2026). ShieldRAG: Safeguarding Retrieval-Augmented Generation from Untrusted Knowledge Bases. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34286–34294. https://doi.org/10.1609/aaai.v40i40.40725

Issue

Section

AAAI Technical Track on Natural Language Processing V