TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang; Zifan Han; Hongbo Sun; Sanping Zhou; Xuchong Zhang; Xin Wei; Ye Yuan; Huayu Zhang; Jinglin Xu; Hao Sun

doi:10.1609/aaai.v40i11.37896

Authors

Canhui Tang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University Institute of Artificial Intelligence (TeleAI), China Telecom
Zifan Han National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University Institute of Artificial Intelligence (TeleAI), China Telecom
Hongbo Sun Institute of Artificial Intelligence (TeleAI), China Telecom
Sanping Zhou National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Xuchong Zhang National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Xin Wei Institute of Artificial Intelligence (TeleAI), China Telecom
Ye Yuan Institute of Artificial Intelligence (TeleAI), China Telecom
Huayu Zhang Institute of Artificial Intelligence (TeleAI), China Telecom
Jinglin Xu University of Science and Technology Beijing
Hao Sun Institute of Artificial Intelligence (TeleAI), China Telecom

DOI:

https://doi.org/10.1609/aaai.v40i11.37896

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (**TSPO**), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information