Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance
DOI:
https://doi.org/10.1609/aaai.v39i6.32698Abstract
Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.Downloads
Published
2025-04-11
How to Cite
Pham, D.-H., Nguyen, D.-D., Pham, A., Ho, T., Nguyen, P., Nguyen, K., & Nguyen, R. (2025). Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 6514-6522. https://doi.org/10.1609/aaai.v39i6.32698
Issue
Section
AAAI Technical Track on Computer Vision V