Multi-view Invariance Learning for 3D Scene Graph Pre-training via Collaborative Cross-Modal Regularization

Authors

  • Yucheng Huang University of Electronic Science and Technology of China
  • Luping Ji University of Electronic Science and Technology of China
  • Ruijie Xiao University of Electronic Science and Technology of China
  • Jiayuan Sun University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i7.37435

Abstract

3D scene graph generation is a pivotal task in scene understanding. Its performance is easy to be constrained by the limited availability of annotated data. Currently, the existing solutions on point cloud pre-training usually emphasize on object-centric representations while neglecting the predicate feature learning. This limitation significantly hinders their relational reasoning capabilities, as inter-object relationships are fundamentally governed by predicate features. To enhance 3D Scene Graphs Pre-training, this paper proposes a task-specific Multi-view Invariance Learning framework with Collaborative Cross-modal Regularization. In detail, the inherent horizontal-rotation invariance of 3D objects and their semantic relationships are leveraged to construct a self-supervised paradigm for triplet feature learning. Moreover, our framework harnesses the cross-modal prior knowledge from the vision-language model to regularize model optimization. It could further achieve the semantic discrimination via unsupervised deep clustering. To resolve the knowledge discrepancies arising from the pre-trained model in fine-tuning, a predicate adapter equipped with knowledge filtering gate is devised to selectively aggregate the predicate features of pre-trained model. Extensive experiments demonstrate that our framework is effective in boosting 3D scene graph generation performance, surpassing state-of-the-art ones.

Downloads

Published

2026-03-14

How to Cite

Huang, Y., Ji, L., Xiao, R., & Sun, J. (2026). Multi-view Invariance Learning for 3D Scene Graph Pre-training via Collaborative Cross-Modal Regularization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5203–5211. https://doi.org/10.1609/aaai.v40i7.37435

Issue

Section

AAAI Technical Track on Computer Vision IV