Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Authors

  • Hui Zeng Xidian University Zhongguancun Academy
  • Daming Zhao Tsinghua University
  • Pengfei Yang Xidian University
  • WenXuan Hou Xidian University
  • Tianyang Zheng Xidian University
  • Hui Li Xidian University
  • Weiye Ji Xidian University
  • Jidong Zhai Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i33.40036

Abstract

Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56×.

Downloads

Published

2026-03-14

How to Cite

Zeng, H., Zhao, D., Yang, P., Hou, W., Zheng, T., Li, H., … Zhai, J. (2026). Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 28103–28112. https://doi.org/10.1609/aaai.v40i33.40036

Issue

Section

AAAI Technical Track on Machine Learning X