Aligning Cross-View Visual Geometries in LVLMs Through Human-Like Reasoning Learning
DOI:
https://doi.org/10.1609/aaai.v40i23.39011Abstract
Spatial understanding is a critical capability for LVLMs (Large Vision-Language Models) to advance embodied AI applications. Existing works primarily focus on enhancing spatial understanding within a single frame, i.e., injecting 3D spatial concepts into LVLMs under single coordinate system. However, such improvements struggle in real-world tasks that require consistent cross-view spatial reasoning. In this paper, we propose CVVG-Reasoner(Cross-View Visual Geometries) that lifts single-frame spatial comprehension to unified cross-view spatial understanding by mimicking human-like cross-view reasoning mechanisms. First, we introduce MV3DSR(Multi-View 3D Spatial Reasoning), a scalable pipeline for cross-view spatial reasoning data generation, and construct MV3DSR-Dataset, a large-scale dataset with diverse 3D cross-view reasoning tasks. Based on MV3DSR, we propose MV3DSR-Bench, a comprehensive benchmark for evaluating cross-view spatial reasoning capabilities. Second, we design a three-stage training strategy: the first two stages progressively equip the model with (1) fundamental spatial knowledge and (2) human-like cross-view reasoning patterns, while the final stage employs reinforcement learning to further boost its performance. Extensive experiments demonstrate that our CVVG-Reasoner significantly outperforms existing 3D LLMs(Large Language Models) and advanced LVLMs in cross-view tasks while maintaining robust performance on out-of-domain data. Ablations further reveal that injecting human-like reasoning patterns yields 44% performance gain, validating the effectiveness of our design.Downloads
Published
2026-03-14
How to Cite
Qiao, Y., Luo, L., Meng, D., Yang, Y., Wang, Q., Wang, J., … Zhang, X. (2026). Aligning Cross-View Visual Geometries in LVLMs Through Human-Like Reasoning Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(23), 19345–19353. https://doi.org/10.1609/aaai.v40i23.39011
Issue
Section
AAAI Technical Track on Knowledge Representation and Reasoning