CLIPDet3D: Vision-Language Collaborative Distillation for 3D Object Detection

Authors

  • Jiaqi Zhao China Unviersity of Mining and Technology
  • Huanfeng Hu China Unviersity of Mining and Technology
  • Yong Zhou China Unviersity of Mining and Technology
  • Wen-Liang Du China Unviersity of Mining and Technology
  • Kunyang Sun China Unviersity of Mining and Technology
  • Rui Yao China Unviersity of Mining and Technology
  • Qigong Sun Sensetime

DOI:

https://doi.org/10.1609/aaai.v40i16.38316

Abstract

Multi-view 3D object detection plays a vital role in autonomous driving systems due to its ability to perceive complex scenes accurately. However, real-world driving data often exhibits a long-tailed distribution, causing significant drops in detection accuracy for rare categories in existing methods. To mitigate this issue, we propose CLIPDet3D, a novel vision-language collaborative framework for multi-view 3D object detection. First, to tackle the difficulty of capturing the semantic information of rare categories, a Vision-Language Collaborative Learning strategy is proposed to incorporate class-level semantic priors from CLIP. Second, a Depth Feature Contrastive Distillation module is designed to overcome the large depth estimation error for rare categories by aligning depth features between a teacher and a student network. Furthermore, to alleviate the difficulty in focusing on regions of rare categories, a Dual-Stream Prompt Attention mechanism is devised to inject learnable prompts and compute attention along both horizontal and vertical BEV directions. Evaluations on the nuScenes dataset demonstrate that CLIPDet3D achieves state-of-the-art accuracy while maintaining efficient inference.

Downloads

Published

2026-03-14

How to Cite

Zhao, J., Hu, H., Zhou, Y., Du, W.-L., Sun, K., Yao, R., & Sun, Q. (2026). CLIPDet3D: Vision-Language Collaborative Distillation for 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13154–13162. https://doi.org/10.1609/aaai.v40i16.38316

Issue

Section

AAAI Technical Track on Computer Vision XIII