Dual-Teacher Interactive Knowledge Distillation Network for Text-to-Visible & Infrared Person Retrieval

Authors

  • Chenglong Li School of Artificial Intelligence, Anhui University; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University
  • Zhengyu Chen School of Artificial Intelligence, Anhui University; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University
  • Yifei Deng School of Computer Science and Technology, Anhui University
  • Aihua Zheng School of Artificial Intelligence, Anhui University; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University

DOI:

https://doi.org/10.1609/aaai.v40i8.37527

Abstract

Text-to-visible & infrared person retrieval aims to retrieve the corresponding visible (RGB) and thermal infrared (TIR) images given the text descriptions. Existing methods perform semantic decoupling by aligning RGB and TIR features separately to different attributes, thereby facilitating the alignment between the fused multimodal representation and the text. However, insufficient TIR representation ability and cross-view representation capabilities of RGB and TIR modalities limit the retrieval accuracy and robustness. To address these issues, we propose a novel Dual-teacher Interactive Knowledge Distillation Network called DIKDNet, which performs the interactive knowledge distillation between two modality-specific teachers with rich cross-view representation capabilities to enhance TIR representations and the collaborative knowledge distillation from both teachers to the corresponding students to enhance the cross-modal cross-view representations, for robust text-to-visible & infrared person retrieval. Specifically, to enhance the representation ability of the TIR backbone network while preserving modality-specific characteristics, we design an Interactive Knowledge Distillation Module (IKDM), which introduces a boundary-constrained distillation strategy between RGB and TIR backbones, to transfer the semantic features of RGB backbone to TIR one. To enhance the cross-modal cross-view representation capability, we design a Collaborative Knowledge Distillation Module (CKDM) to transfer the cross-modal similarity relations and the cross-view multimodal representations from teacher networks to student ones. Experimental results demonstrate that our method consistently achieves significant performance gains on both the RGBT-PEDES and RGBNT201-PEDES datasets. The code will be released upon the acceptance.

Downloads

Published

2026-03-14

How to Cite

Li, C., Chen, Z., Deng, Y., & Zheng, A. (2026). Dual-Teacher Interactive Knowledge Distillation Network for Text-to-Visible & Infrared Person Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6037–6045. https://doi.org/10.1609/aaai.v40i8.37527

Issue

Section

AAAI Technical Track on Computer Vision V