Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection

Authors

  • Rui Zhao Shenzhen University
  • Shuoyao Wang Shenzhen University
  • Xinhu Zheng The Hong Kong University of Science and Technology, Guangzhou
  • Shijian Gao The Hong Kong University of Science and Technology, Guangzhou, Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong, Shenzhen

DOI:

https://doi.org/10.1609/aaai.v40i16.38323

Abstract

Multi-view 3D object detection has garnered increasing attention, particularly due to its success in autonomous driving systems. Although multi-view systems possess rich semantic information, their spatial-geometric reasoning capabilities remain limited. Recent studies employ simulated point cloud generation mechanisms to facilitate LiDAR-camera multi-modal knowledge distillation, achieving formal structural consistency. Despite advancements, these methods still face two main issues: i) alignment challenges caused by discrepancies between LiDAR and camera data, and ii) prediction errors from simulated point clouds that compromise the semantic information extracted from images during fusion. To address these problems, we propose adaptive-smooth distillation to optimize alignment granularity based on feature discrepancies for improved LiDAR-camera knowledge distillation. Specifically, this work considers both LIDAR-to-camera cross-modal distillation and LiDAR-camera fusion to simulated point cloud-camera fusion multi-modal distillation. Then, we introduce a heterogeneous fusion module to strategically bias the fusion process toward the extracted camera features, thereby enhancing the robustness of the fusion feature. Additionally, soft-weighted response distillation is proposed to facilitate the student model to selectively mimic the high-quality output of the teacher model. Extensive experiments have demonstrated the superiority of our method, achieving statistically significant improvements of 4.9% in mean Average Precision (mAP) and 4.5% in NuScenes Detection Score (NDS) over the benchmark.

Downloads

Published

2026-03-14

How to Cite

Zhao, R., Wang, S., Zheng, X., & Gao, S. (2026). Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13217–13225. https://doi.org/10.1609/aaai.v40i16.38323

Issue

Section

AAAI Technical Track on Computer Vision XIII