Learning Better UAV-Based Cross-View Object Geo-Localization from Multi-Modal Prompts: MoP-UAV Benchmark and MoPT Framework

Authors

  • Xiaohan Zhang Zhejiang University
  • Zhangkai Shen Zhejiang University
  • Si-Yuan Cao Zhejiang University
  • Xiaokai Bai Zhejiang University
  • Yiming Li Zhejiang University
  • Zheheng Han Zhejiang University
  • Zhe Wu Zhejiang University
  • Qi Ming Beijing University of Technology
  • Hui-Liang Shen Zhejiang University Key Laboratory of Airspace Sensing and Autonomous Unmanned Systems of Zhejiang Province

DOI:

https://doi.org/10.1609/aaai.v40i15.38282

Abstract

We present MoP-UAV, a new benchmark for UAV-based cross-view object geo-localization guided by multi-modal prompts. MoP-UAV supports fine-grained object-level cross-view localization under diverse prompt modalities, including natural language, bounding boxes, and click points. It offers potential for incorporating large foundation models like large language models (LLMs) and promotes the building of more flexible and intelligent UAV agents. Based on the benchmark, we propose MoPT, a multi-modal-prompt-guided tansformer that embeds prompts as token sequences and extract object location from UAV and satellite features via cross-attention. To enhance semantic consistency and performance, we further adopt a cross-view contrastive loss and propose a RefCOCOg-based pre-training strategy. Extensive experiments show that MoPT achieves robust localization under arbitrary prompt combinations. Notably, multi-modal-prompt training significantly boosts unimodal-prompt inference performance, highlighting the generalization benefits of multi-modal learning. MoPT trained with multi-modal prompts outperforms prior unimodal prompt works under the same setting.

Downloads

Published

2026-03-14

How to Cite

Zhang, X., Shen, Z., Cao, S.-Y., Bai, X., Li, Y., Han, Z., … Shen, H.-L. (2026). Learning Better UAV-Based Cross-View Object Geo-Localization from Multi-Modal Prompts: MoP-UAV Benchmark and MoPT Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 12843–12851. https://doi.org/10.1609/aaai.v40i15.38282

Issue

Section

AAAI Technical Track on Computer Vision XII