CLIPDet3D: Vision-Language Collaborative Distillation for 3D Object Detection

Jiaqi Zhao; Huanfeng Hu; Yong Zhou; Wen-Liang Du; Kunyang Sun; Rui Yao; Qigong Sun

doi:10.1609/aaai.v40i16.38316

Authors

Jiaqi Zhao China Unviersity of Mining and Technology
Huanfeng Hu China Unviersity of Mining and Technology
Yong Zhou China Unviersity of Mining and Technology
Wen-Liang Du China Unviersity of Mining and Technology
Kunyang Sun China Unviersity of Mining and Technology
Rui Yao China Unviersity of Mining and Technology
Qigong Sun Sensetime

DOI:

https://doi.org/10.1609/aaai.v40i16.38316

Abstract

Multi-view 3D object detection plays a vital role in autonomous driving systems due to its ability to perceive complex scenes accurately. However, real-world driving data often exhibits a long-tailed distribution, causing significant drops in detection accuracy for rare categories in existing methods. To mitigate this issue, we propose CLIPDet3D, a novel vision-language collaborative framework for multi-view 3D object detection. First, to tackle the difficulty of capturing the semantic information of rare categories, a Vision-Language Collaborative Learning strategy is proposed to incorporate class-level semantic priors from CLIP. Second, a Depth Feature Contrastive Distillation module is designed to overcome the large depth estimation error for rare categories by aligning depth features between a teacher and a student network. Furthermore, to alleviate the difficulty in focusing on regions of rare categories, a Dual-Stream Prompt Attention mechanism is devised to inject learnable prompts and compute attention along both horizontal and vertical BEV directions. Evaluations on the nuScenes dataset demonstrate that CLIPDet3D achieves state-of-the-art accuracy while maintaining efficient inference.

CLIPDet3D: Vision-Language Collaborative Distillation for 3D Object Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information