Dual-Teacher Interactive Knowledge Distillation Network for Text-to-Visible & Infrared Person Retrieval

Chenglong Li; Zhengyu Chen; Yifei Deng; Aihua Zheng

doi:10.1609/aaai.v40i8.37527

Authors

Chenglong Li School of Artificial Intelligence, Anhui University; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University
Zhengyu Chen School of Artificial Intelligence, Anhui University; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University
Yifei Deng School of Computer Science and Technology, Anhui University
Aihua Zheng School of Artificial Intelligence, Anhui University; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University

DOI:

https://doi.org/10.1609/aaai.v40i8.37527

Abstract

Text-to-visible & infrared person retrieval aims to retrieve the corresponding visible (RGB) and thermal infrared (TIR) images given the text descriptions. Existing methods perform semantic decoupling by aligning RGB and TIR features separately to different attributes, thereby facilitating the alignment between the fused multimodal representation and the text. However, insufficient TIR representation ability and cross-view representation capabilities of RGB and TIR modalities limit the retrieval accuracy and robustness. To address these issues, we propose a novel Dual-teacher Interactive Knowledge Distillation Network called DIKDNet, which performs the interactive knowledge distillation between two modality-specific teachers with rich cross-view representation capabilities to enhance TIR representations and the collaborative knowledge distillation from both teachers to the corresponding students to enhance the cross-modal cross-view representations, for robust text-to-visible & infrared person retrieval. Specifically, to enhance the representation ability of the TIR backbone network while preserving modality-specific characteristics, we design an Interactive Knowledge Distillation Module (IKDM), which introduces a boundary-constrained distillation strategy between RGB and TIR backbones, to transfer the semantic features of RGB backbone to TIR one. To enhance the cross-modal cross-view representation capability, we design a Collaborative Knowledge Distillation Module (CKDM) to transfer the cross-modal similarity relations and the cross-view multimodal representations from teacher networks to student ones. Experimental results demonstrate that our method consistently achieves significant performance gains on both the RGBT-PEDES and RGBNT201-PEDES datasets. The code will be released upon the acceptance.

Dual-Teacher Interactive Knowledge Distillation Network for Text-to-Visible & Infrared Person Retrieval

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information