Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
DOI:
https://doi.org/10.1609/aaai.v40i14.38188Abstract
Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we introduce All-Angles Bench, a human carefully benchmark with over 2,100 question-answer pairs from 90 diverse, real-world scenes. Our broad evaluation across 38 general-purpose and 3D spatial reasoning MLLMs reveals a substantial performance gap compared to humans. More critically, our analysis identifies two root failure modes: (1) cross-view object mismatch—the inability to establish consistent object correspondence across views; and (2) cross-view spatial misalignment—the failure to infer accurate camera poses and spatial layouts. These findings underscore a lack of multi-view awareness in current MLLMs, calling for architectural innovations beyond prompt tuning alone. We believe that our benchmark offers valuable insights toward building spatially-intelligent MLLMs.Downloads
Published
2026-03-14
How to Cite
Yeh, C.-H., Wang, C., Tong, S., Cheng, T.-Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., & Ma, Y. (2026). Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 12000-12008. https://doi.org/10.1609/aaai.v40i14.38188
Issue
Section
AAAI Technical Track on Computer Vision XI