DAVID: Dual-stage Adaptive Vision-text Integrated Decoupling for Multimodal KV Cache Eviction

Authors

  • Yifeng Gu South China University of Technology
  • Jianxiu Jin South China University of Technology
  • Kailing Guo South China University of Technology
  • Xiangmin Xu Foshan University South China University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i26.39285

Abstract

With the rapid development of multimodal large language models (MLLMs), deploying them on low-resource devices remains challenging. Beyond the model size, long multimodal inputs cause substantial memory overhead in the KV cache, making efficient cache management critical. In this paper, we propose DAVID, a KV cache eviction strategy that adapts to the degree of modality fusion across layers. By analyzing the feature distributions of vision and text tokens, we observe low fusion in early layers and high fusion in deeper layers. Based on this observation, DAVID adopts a decoupled eviction strategy in shallow layers and a super-modal eviction strategy in deeper layers. To support this dynamic switching, we design a lightweight metric that quantifies cross-modal fusion and uses a threshold to determine which layers require decoupling. Experimental results show that DAVID achieves state-of-the-art performance on multiple benchmarks and offers a new perspective on KV cache eviction for MLLMs.

Downloads

Published

2026-03-14

How to Cite

Gu, Y., Jin, J., Guo, K., & Xu, X. (2026). DAVID: Dual-stage Adaptive Vision-text Integrated Decoupling for Multimodal KV Cache Eviction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 21387–21395. https://doi.org/10.1609/aaai.v40i26.39285

Issue

Section

AAAI Technical Track on Machine Learning III