QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Yongdong Luo; Wang Chen; Weizhong Huang; Shukang Yin; Haojia Lin; Jinfa Huang; Chaoyou Fu; Jiayi Ji; Xiawu Zheng; Jiebo Luo

doi:10.1609/aaai.v40i29.39595

Authors

Yongdong Luo Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Wang Chen Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Weizhong Huang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Shukang Yin Independent Researcher
Haojia Lin Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Jinfa Huang University of Rochester
Chaoyou Fu Nanjing University
Jiayi Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Xiawu Zheng Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Jiebo Luo University of Rochester

DOI:

https://doi.org/10.1609/aaai.v40i29.39595

Abstract

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline.

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information