Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin; Xin Zou; Di Lu; Yibo Yan; Xuming Hu

doi:10.1609/aaai.v40i10.37805

Authors

Jialong Qin The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology Beijing Institute of Technology
Xin Zou The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology
Di Lu The Hong Kong University of Science and Technology (Guangzhou)
Yibo Yan The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology
Xuming Hu The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i10.37805

Abstract

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information