Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Hui Zeng; Daming Zhao; Pengfei Yang; WenXuan Hou; Tianyang Zheng; Hui Li; Weiye Ji; Jidong Zhai

doi:10.1609/aaai.v40i33.40036

Authors

Hui Zeng Xidian University Zhongguancun Academy
Daming Zhao Tsinghua University
Pengfei Yang Xidian University
WenXuan Hou Xidian University
Tianyang Zheng Xidian University
Hui Li Xidian University
Weiye Ji Xidian University
Jidong Zhai Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v40i33.40036

Abstract

Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56×.

Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information