GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Yushuo Zheng; Jiangyong Ying; Huiyu Duan; Chunyi Li; Zicheng Zhang; Jing Liu; Xiaohong Liu; Guangtao Zhai

doi:10.1609/aaai.v40i16.38353

Authors

Yushuo Zheng Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
Jiangyong Ying China Telecom
Huiyu Duan Shanghai Jiao Tong University
Chunyi Li Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
Zicheng Zhang Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory
Jing Liu Tianjin University
Xiaohong Liu Shanghai Jiao Tong University Shanghai Jiao Tong University Sichuan Research Institute
Guangtao Zhai Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i16.38353

Abstract

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, etc. To bridge this gap, we introduce GeoX-Bench, a comprehensive Benchmark designed to explore and evaluate the capabilities of LMMs in cross-view Geo-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities.

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information