H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion

Authors

  • Yu Wang Sino-French Engineer School, Beihang University
  • Chao Tong School of Computer Science and Engineering, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beihang University

DOI:

https://doi.org/10.1609/aaai.v38i6.28384

Keywords:

CV: 3D Computer Vision, CV: Scene Analysis & Understanding

Abstract

3D Semantic Scene Completion (SSC) has emerged as a novel task in vision-based holistic 3D scene understanding. Its objective is to densely predict the occupancy and category of each voxel in a 3D scene based on input from either LiDAR or images. Currently, many transformer-based semantic scene completion frameworks employ simple yet popular Cross-Attention and Self-Attention mechanisms to integrate and infer dense geometric and semantic information of voxels. However, they overlook the distinctions among voxels in the scene, especially in outdoor scenarios where the horizontal direction contains more variations. And voxels located at object boundaries and within the interior of objects exhibit varying levels of positional significance. To address this issue, we propose a transformer-based SSC framework called H2GFormer that incorporates a horizontal-to-global approach. This framework takes into full consideration the variations of voxels in the horizontal direction and the characteristics of voxels on object boundaries. We introduce a horizontal window-to-global attention (W2G) module that effectively fuses semantic information by first diffusing it horizontally from reliably visible voxels and then propagating the semantic understanding to global voxels, ensuring a more reliable fusion of semantic-aware features. Moreover, an Internal-External Position Awareness Loss (IoE-PALoss) is utilized during network training to emphasize the critical positions within the transition regions between objects. The experiments conducted on the SemanticKITTI dataset demonstrate that H2GFormer exhibits superior performance in both geometric and semantic completion tasks. Our code is available on https://github.com/Ryanwy1/H2GFormer.

Published

2024-03-24

How to Cite

Wang, Y., & Tong, C. (2024). H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5722-5730. https://doi.org/10.1609/aaai.v38i6.28384

Issue

Section

AAAI Technical Track on Computer Vision V