SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation

Authors

  • Changsheng Lv Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications
  • Mengshi Qi Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications
  • Xia Li Beijing University of Posts and Telecommunications
  • Zhengyuan Yang University of Rochester
  • Huadong Ma Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v38i5.28197

Keywords:

CV: 3D Computer Vision, CV: Scene Analysis & Understanding, CV: Language and Vision, CV: Visual Reasoning & Symbolic Representations

Abstract

In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation. The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and can only propagate information from limited neighboring nodes. In contrast, SGFormer uses Transformer layers as the base building block to allow global information passing, with two types of newly-designed layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Furthermore, we propose the semantic injection layer to leverage linguistic knowledge from large-scale language model (i.e., ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the established 3DSSG dataset and achieve a 40.94% absolute improvement in relationship prediction's R@50 and an 88.36% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGFormer's superiority in the long-tail and zero-shot scenarios. Our source code is available at https://github.com/Andy20178/SGFormer.

Published

2024-03-24

How to Cite

Lv, C., Qi, M., Li, X., Yang, Z., & Ma, H. (2024). SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4035-4043. https://doi.org/10.1609/aaai.v38i5.28197

Issue

Section

AAAI Technical Track on Computer Vision IV