Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection
DOI:
https://doi.org/10.1609/aaai.v40i16.38323Abstract
Multi-view 3D object detection has garnered increasing attention, particularly due to its success in autonomous driving systems. Although multi-view systems possess rich semantic information, their spatial-geometric reasoning capabilities remain limited. Recent studies employ simulated point cloud generation mechanisms to facilitate LiDAR-camera multi-modal knowledge distillation, achieving formal structural consistency. Despite advancements, these methods still face two main issues: i) alignment challenges caused by discrepancies between LiDAR and camera data, and ii) prediction errors from simulated point clouds that compromise the semantic information extracted from images during fusion. To address these problems, we propose adaptive-smooth distillation to optimize alignment granularity based on feature discrepancies for improved LiDAR-camera knowledge distillation. Specifically, this work considers both LIDAR-to-camera cross-modal distillation and LiDAR-camera fusion to simulated point cloud-camera fusion multi-modal distillation. Then, we introduce a heterogeneous fusion module to strategically bias the fusion process toward the extracted camera features, thereby enhancing the robustness of the fusion feature. Additionally, soft-weighted response distillation is proposed to facilitate the student model to selectively mimic the high-quality output of the teacher model. Extensive experiments have demonstrated the superiority of our method, achieving statistically significant improvements of 4.9% in mean Average Precision (mAP) and 4.5% in NuScenes Detection Score (NDS) over the benchmark.Downloads
Published
2026-03-14
How to Cite
Zhao, R., Wang, S., Zheng, X., & Gao, S. (2026). Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13217–13225. https://doi.org/10.1609/aaai.v40i16.38323
Issue
Section
AAAI Technical Track on Computer Vision XIII