APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Authors

  • Hong Gao School of Computer Science and Engineering, Southeast University, Nanjing 210096, China ZTE Corporation, Nanjing 210012, China
  • Yiming Bao ZTE Corporation, Nanjing 210012, China
  • Xuezhen Tu ZTE Corporation, Nanjing 210012, China
  • Bin Zhong ZTE Corporation, Nanjing 210012, China
  • Linan Yue School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
  • Min-Ling Zhang School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

DOI:

https://doi.org/10.1609/aaai.v40i6.42406

Abstract

Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames, and Pivot Token Retrieval performs query-aware attention-driven token selection within up to 1024 pivot frames. This dual granularity approach enables the processing of hour-long videos while maintaining semantic fidelity. Experimental validations on three different baseline MLLMs demonstrate significant performance improvements up to 9.5%, 4.6% and 9.7% on LongVideoBench, VideoMME and MLVU, respectively. APVR achieves state-of-the-art results for both training-free and training-based approaches.

Downloads

Published

2026-03-14

How to Cite

Gao, H., Bao, Y., Tu, X., Zhong, B., Yue, L., & Zhang, M.-L. (2026). APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4113–4121. https://doi.org/10.1609/aaai.v40i6.42406

Issue

Section

AAAI Technical Track on Computer Vision III