GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Authors

  • Yushuo Zheng Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
  • Jiangyong Ying China Telecom
  • Huiyu Duan Shanghai Jiao Tong University
  • Chunyi Li Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
  • Zicheng Zhang Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
  • Jing Liu Tianjin University
  • Xiaohong Liu Shanghai Jiao Tong University Shanghai Jiao Tong University Sichuan Research Institute
  • Guangtao Zhai Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i16.38353

Abstract

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, etc. To bridge this gap, we introduce GeoX-Bench, a comprehensive Benchmark designed to explore and evaluate the capabilities of LMMs in cross-view Geo-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities.

Published

2026-03-14

How to Cite

Zheng, Y., Ying, J., Duan, H., Li, C., Zhang, Z., Liu, J., … Zhai, G. (2026). GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13485–13493. https://doi.org/10.1609/aaai.v40i16.38353

Issue

Section

AAAI Technical Track on Computer Vision XIII