LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding

Xiaodong Wang; Langling Huang; Zhirong Wu; Xu Zhao; Teng Xu; Xuhong Xia; Peixi Peng

doi:10.1609/aaai.v40i31.39859

Authors

Xiaodong Wang School of Electronic and Computer Engineering, Peking University
Langling Huang School of Electronic and Computer Engineering, Peking University
Zhirong Wu School of Electronic and Computer Engineering, Peking University
Xu Zhao Douyin Group
Teng Xu Douyin Group
Xuhong Xia Douyin Group
Peixi Peng School of Electronic and Computer Engineering, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i31.39859

Abstract

The development of multimodal large language models (MLLMs) has advanced general video understanding. However, existing video evaluation benchmarks primarily focus on non-interactive videos, such as movies and recordings. To fill this gap, this paper proposes the first omnimodal benchmark for interactive livestream videos, LiViBench. It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges. To efficiently construct the dataset, we design a standardized semi-automatic annotation workflow that incorporates the human-in-the-loop at multiple stages. The workflow leverages multiple MLLMs to form a multi-agent system for comprehensive video description and uses a seed-question-driven method to construct high-quality annotations. All interactive videos in the benchmark include audio, speech, and real-time comments modalities. To enhance models' understanding of interactive videos, we design tailored two-stage instruction-tuning and propose a Video-to-Comment Retrieval (VCR) module to improve the model's ability to utilize real-time comments. Based on these advancements, we develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams. Experiments show that our model outperforms larger open-source models with up to 72B parameters, narrows the gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks, including VideoMME, LongVideoBench, MLVU, and VideoEval-Pro.

LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information