Learning Better UAV-Based Cross-View Object Geo-Localization from Multi-Modal Prompts: MoP-UAV Benchmark and MoPT Framework

Xiaohan Zhang; Zhangkai Shen; Si-Yuan Cao; Xiaokai Bai; Yiming Li; Zheheng Han; Zhe Wu; Qi Ming; Hui-Liang Shen

doi:10.1609/aaai.v40i15.38282

Authors

Xiaohan Zhang Zhejiang University
Zhangkai Shen Zhejiang University
Si-Yuan Cao Zhejiang University
Xiaokai Bai Zhejiang University
Yiming Li Zhejiang University
Zheheng Han Zhejiang University
Zhe Wu Zhejiang University
Qi Ming Beijing University of Technology
Hui-Liang Shen Zhejiang University Key Laboratory of Airspace Sensing and Autonomous Unmanned Systems of Zhejiang Province

DOI:

https://doi.org/10.1609/aaai.v40i15.38282

Abstract

We present MoP-UAV, a new benchmark for UAV-based cross-view object geo-localization guided by multi-modal prompts. MoP-UAV supports fine-grained object-level cross-view localization under diverse prompt modalities, including natural language, bounding boxes, and click points. It offers potential for incorporating large foundation models like large language models (LLMs) and promotes the building of more flexible and intelligent UAV agents. Based on the benchmark, we propose MoPT, a multi-modal-prompt-guided tansformer that embeds prompts as token sequences and extract object location from UAV and satellite features via cross-attention. To enhance semantic consistency and performance, we further adopt a cross-view contrastive loss and propose a RefCOCOg-based pre-training strategy. Extensive experiments show that MoPT achieves robust localization under arbitrary prompt combinations. Notably, multi-modal-prompt training significantly boosts unimodal-prompt inference performance, highlighting the generalization benefits of multi-modal learning. MoPT trained with multi-modal prompts outperforms prior unimodal prompt works under the same setting.

Learning Better UAV-Based Cross-View Object Geo-Localization from Multi-Modal Prompts: MoP-UAV Benchmark and MoPT Framework

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information