Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Chun-Hsiao Yeh; Chenyu Wang; Shengbang Tong; Ta-Ying Cheng; Ruoyu Wang; Tianzhe Chu; Yuexiang Zhai; Yubei Chen; Shenghua Gao; Yi Ma

doi:10.1609/aaai.v40i14.38188

Authors

Chun-Hsiao Yeh University of California, Berkeley
Chenyu Wang The University of HongKong Transcengram SLAI
Shengbang Tong New York University
Ta-Ying Cheng University of Oxford
Ruoyu Wang Transcengram
Tianzhe Chu The University of HongKong
Yuexiang Zhai University of California, Berkeley
Yubei Chen University of California, Davis
Shenghua Gao The University of HongKong Transcengram
Yi Ma University of California, Berkeley The University of HongKong

DOI:

https://doi.org/10.1609/aaai.v40i14.38188

Abstract

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we introduce All-Angles Bench, a human carefully benchmark with over 2,100 question-answer pairs from 90 diverse, real-world scenes. Our broad evaluation across 38 general-purpose and 3D spatial reasoning MLLMs reveals a substantial performance gap compared to humans. More critically, our analysis identifies two root failure modes: (1) cross-view object mismatch—the inability to establish consistent object correspondence across views; and (2) cross-view spatial misalignment—the failure to infer accurate camera poses and spatial layouts. These findings underscore a lack of multi-view awareness in current MLLMs, calling for architectural innovations beyond prompt tuning alone. We believe that our benchmark offers valuable insights toward building spatially-intelligent MLLMs.

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information