Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

Authors

  • Youze Wang Hefei University of Technology
  • Zijun Chen Hefei University of Technology
  • Ruoyu Chen Hefei University of Technology
  • Shishen Gu Hefei University of Technology
  • Wenbo Hu Hefei University of Technology
  • Jiayang Liu Institute of Science Tokyo
  • Yinpeng Dong Tsinghua University
  • Hang Su Tsinghua University
  • Jun Zhu Tsinghua University
  • Meng Wang Hefei University of Technology
  • Richang Hong Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i44.41135

Abstract

Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

Published

2026-03-14

How to Cite

Wang, Y., Chen, Z., Chen, R., Gu, S., Hu, W., Liu, J., … Hong, R. (2026). Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37979–37987. https://doi.org/10.1609/aaai.v40i44.41135

Issue

Section

AAAI Special Track on AI Alignment