Text-based Aerial-Ground Person Retrieval
DOI:
https://doi.org/10.1609/aaai.v40i34.40140Abstract
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES and existing T-PR benchmarks.Downloads
Published
2026-03-14
How to Cite
Zhou, X., Wu, Y., Ma, J., Wang, W., Cao, M., & Ye, M. (2026). Text-based Aerial-Ground Person Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 29035-29043. https://doi.org/10.1609/aaai.v40i34.40140
Issue
Section
AAAI Technical Track on Machine Learning XI