Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
DOI:
https://doi.org/10.1609/aaai.v40i37.40420Abstract
Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce Res-Bench, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.Published
2026-03-14
How to Cite
Li, C., Wang, Z., Sheng, Y., Zhu, X., Hao, Y., & Wang, X. (2026). Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31545–31553. https://doi.org/10.1609/aaai.v40i37.40420
Issue
Section
AAAI Technical Track on Natural Language Processing II