Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Authors

  • Yunlong Tang University of Rochester
  • Daiki Shimada Sony Group Corporation
  • Jing Bi University of Rochester
  • Mingqian Feng University of Rochester
  • Hang Hua University of Rochester
  • Chenliang Xu University of Rochester

DOI:

https://doi.org/10.1609/aaai.v39i7.32784

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

Downloads

Published

2025-04-11

How to Cite

Tang, Y., Shimada, D., Bi, J., Feng, M., Hua, H., & Xu, C. (2025). Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7293–7301. https://doi.org/10.1609/aaai.v39i7.32784

Issue

Section

AAAI Technical Track on Computer Vision VI