Multi-view Invariance Learning for 3D Scene Graph Pre-training via Collaborative Cross-Modal Regularization
DOI:
https://doi.org/10.1609/aaai.v40i7.37435Abstract
3D scene graph generation is a pivotal task in scene understanding. Its performance is easy to be constrained by the limited availability of annotated data. Currently, the existing solutions on point cloud pre-training usually emphasize on object-centric representations while neglecting the predicate feature learning. This limitation significantly hinders their relational reasoning capabilities, as inter-object relationships are fundamentally governed by predicate features. To enhance 3D Scene Graphs Pre-training, this paper proposes a task-specific Multi-view Invariance Learning framework with Collaborative Cross-modal Regularization. In detail, the inherent horizontal-rotation invariance of 3D objects and their semantic relationships are leveraged to construct a self-supervised paradigm for triplet feature learning. Moreover, our framework harnesses the cross-modal prior knowledge from the vision-language model to regularize model optimization. It could further achieve the semantic discrimination via unsupervised deep clustering. To resolve the knowledge discrepancies arising from the pre-trained model in fine-tuning, a predicate adapter equipped with knowledge filtering gate is devised to selectively aggregate the predicate features of pre-trained model. Extensive experiments demonstrate that our framework is effective in boosting 3D scene graph generation performance, surpassing state-of-the-art ones.Published
2026-03-14
How to Cite
Huang, Y., Ji, L., Xiao, R., & Sun, J. (2026). Multi-view Invariance Learning for 3D Scene Graph Pre-training via Collaborative Cross-Modal Regularization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5203–5211. https://doi.org/10.1609/aaai.v40i7.37435
Issue
Section
AAAI Technical Track on Computer Vision IV