AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

Authors

  • Chenglizhao Chen Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software
  • Shaofeng Liang Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China) Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software
  • Runwei Guan Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)
  • Xiaolou Sun Purple Mountain Laboratories
  • Haocheng Zhao School of Advanced Technology, Xi'an Jiaotong-Liverpool University
  • Haiyun Jiang School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
  • Tao Huang College of Science and Engineering, James Cook University
  • Henghui Ding Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University
  • Qing-Long Han School of Engineering, Swinburne University of Technology, Melbourne

DOI:

https://doi.org/10.1609/aaai.v40i4.37270

Abstract

Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.

Published

2026-03-14

How to Cite

Chen, C., Liang, S., Guan, R., Sun, X., Zhao, H., Jiang, H., … Han, Q.-L. (2026). AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2805–2813. https://doi.org/10.1609/aaai.v40i4.37270

Issue

Section

AAAI Technical Track on Computer Vision I