Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Yunlong Tang; Daiki Shimada; Jing Bi; Mingqian Feng; Hang Hua; Chenliang Xu

doi:10.1609/aaai.v39i7.32784

Authors

Yunlong Tang University of Rochester
Daiki Shimada Sony Group Corporation
Jing Bi University of Rochester
Mingqian Feng University of Rochester
Hang Hua University of Rochester
Chenliang Xu University of Rochester

DOI:

https://doi.org/10.1609/aaai.v39i7.32784

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information