Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Authors

  • Yanhao Dong Alibaba Cloud
  • Yubo Miao Alibaba Cloud
  • Weinan Li Alibaba Cloud
  • Xiao Zheng Alibaba Cloud
  • Chao Wang Alibaba Cloud
  • Jiesheng Wu Alibaba Cloud
  • Feng Lyu Central South University

DOI:

https://doi.org/10.1609/aaai.v40i25.39224

Abstract

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15× improvement in attention kernel efficiency and up to 1.97× end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

Downloads

Published

2026-03-14

How to Cite

Dong, Y., Miao, Y., Li, W., Zheng, X., Wang, C., Wu, J., & Lyu, F. (2026). Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 20844–20851. https://doi.org/10.1609/aaai.v40i25.39224

Issue

Section

AAAI Technical Track on Machine Learning II