SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

Authors

  • Qiuyu Zhu Nanyang Technological University
  • Liang Zhang Hong Kong University of Science and Technology (Guangzhou)
  • Qianxiong Xu Nanyang Technological University
  • Cheng Long Nanyang Technological University
  • Jie Zhang Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v40i41.40827

Abstract

Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Extensive experiments on three datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68x reduction in time-to-first-token (TTFT).

Published

2026-03-14

How to Cite

Zhu, Q., Zhang, L., Xu, Q., Long, C., & Zhang, J. (2026). SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35204–35212. https://doi.org/10.1609/aaai.v40i41.40827

Issue

Section

AAAI Technical Track on Natural Language Processing VI