TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs

Yunxiao Wang; Meng Liu; Wenqi Liu; Xuemeng Song; Bin Wen; Fan Yang; Tingting Gao; Di Zhang; Guorui Zhou; Liqiang Nie

doi:10.1609/aaai.v40i12.38002

Authors

Yunxiao Wang Shandong University, Jinan, China
Meng Liu Shandong Jianzhu University, Jinan, China
Wenqi Liu Shandong University, Jinan, China
Xuemeng Song Southern University of Science and Technology, Shenzhen, China
Bin Wen Kuaishou Technology, Beijing, China
Fan Yang Kuaishou Technology, Beijing, China
Tingting Gao Kuaishou Technology, Beijing, China
Di Zhang Kuaishou Technology, Beijing, China
Guorui Zhou Kuaishou Technology, Beijing, China
Liqiang Nie Shandong University, Jinan, China

DOI:

https://doi.org/10.1609/aaai.v40i12.38002

Abstract

Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.

TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information