QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Authors

  • Yongdong Luo Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
  • Wang Chen Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
  • Weizhong Huang Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
  • Shukang Yin Independent Researcher
  • Haojia Lin Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
  • Jinfa Huang University of Rochester
  • Chaoyou Fu Nanjing University
  • Jiayi Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
  • Xiawu Zheng Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
  • Jiebo Luo University of Rochester

DOI:

https://doi.org/10.1609/aaai.v40i29.39595

Abstract

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline.

Downloads

Published

2026-03-14

How to Cite

Luo, Y., Chen, W., Huang, W., Yin, S., Lin, H., Huang, J., … Luo, J. (2026). QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension. Proceedings of the AAAI Conference on Artificial Intelligence, 40(29), 24160–24168. https://doi.org/10.1609/aaai.v40i29.39595

Issue

Section

AAAI Technical Track on Machine Learning VI