Text-based Aerial-Ground Person Retrieval

Xinyu Zhou; Yu Wu; Jiayao Ma; Wenhao Wang; Min Cao; Mang Ye

doi:10.1609/aaai.v40i34.40140

Authors

Xinyu Zhou School of Computer Science and Technology, Soochow University
Yu Wu School of Computer Science and Technology, Soochow University
Jiayao Ma AgiBot
Wenhao Wang AgiBot
Min Cao School of Computer Science and Technology, Soochow University
Mang Ye School of Computer Science, Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i34.40140

Abstract

This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES and existing T-PR benchmarks.

Text-based Aerial-Ground Person Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information