TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Authors

  • Canhui Tang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University Institute of Artificial Intelligence (TeleAI), China Telecom
  • Zifan Han National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University Institute of Artificial Intelligence (TeleAI), China Telecom
  • Hongbo Sun Institute of Artificial Intelligence (TeleAI), China Telecom
  • Sanping Zhou National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
  • Xuchong Zhang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
  • Xin Wei Institute of Artificial Intelligence (TeleAI), China Telecom
  • Ye Yuan Institute of Artificial Intelligence (TeleAI), China Telecom
  • Huayu Zhang Institute of Artificial Intelligence (TeleAI), China Telecom
  • Jinglin Xu University of Science and Technology Beijing
  • Hao Sun Institute of Artificial Intelligence (TeleAI), China Telecom

DOI:

https://doi.org/10.1609/aaai.v40i11.37896

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (**TSPO**), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

Downloads

Published

2026-03-14

How to Cite

Tang, C., Han, Z., Sun, H., Zhou, S., Zhang, X., Wei, X., … Sun, H. (2026). TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9368–9376. https://doi.org/10.1609/aaai.v40i11.37896

Issue

Section

AAAI Technical Track on Computer Vision VIII