Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang; Zijun Chen; Ruoyu Chen; Shishen Gu; Wenbo Hu; Jiayang Liu; Yinpeng Dong; Hang Su; Jun Zhu; Meng Wang; Richang Hong

doi:10.1609/aaai.v40i44.41135

Authors

Youze Wang Hefei University of Technology
Zijun Chen Hefei University of Technology
Ruoyu Chen Hefei University of Technology
Shishen Gu Hefei University of Technology
Wenbo Hu Hefei University of Technology
Jiayang Liu Institute of Science Tokyo
Yinpeng Dong Tsinghua University
Hang Su Tsinghua University
Jun Zhu Tsinghua University
Meng Wang Hefei University of Technology
Richang Hong Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i44.41135

Abstract

Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information