Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding
DOI:
https://doi.org/10.1609/aaai.v40i44.41135Abstract
Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.Downloads
Published
2026-03-14
How to Cite
Wang, Y., Chen, Z., Chen, R., Gu, S., Hu, W., Liu, J., … Hong, R. (2026). Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37979–37987. https://doi.org/10.1609/aaai.v40i44.41135
Issue
Section
AAAI Special Track on AI Alignment