3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving

Authors

  • Boyi Sun Institute of Automation, Chinese Academy of Science Zhongke JingYu Sensing Technology Co., Ltd
  • Yuhang Liu Institute of Automation, Chinese Academy of Science Zhongke JingYu Sensing Technology Co., Ltd
  • Xingxia Wang Institute of Automation, Chinese Academy of Science
  • Bin Tian Institute of Automation, Chinese Academy of Science Waytous
  • Long Chen Institute of Automation, Chinese Academy of Science Waytous
  • Fei-Yue Wang Institute of Automation, Chinese Academy of Science

DOI:

https://doi.org/10.1609/aaai.v39i7.32760

Abstract

Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D Annotation-Free framework assisted by 2D Open-Vocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13% mIoU. Meanwhile, the performance of fine-tuning with 1% data on nuScenes and SemanticKITTI reached a remarkable 51.75% mIoU and 48.14% mIoU, outperforming all previous pre-trained models.

Downloads

Published

2025-04-11

How to Cite

Sun, B., Liu, Y., Wang, X., Tian, B., Chen, L., & Wang, F.-Y. (2025). 3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7078–7086. https://doi.org/10.1609/aaai.v39i7.32760

Issue

Section

AAAI Technical Track on Computer Vision VI