X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks
DOI:
https://doi.org/10.1609/aaai.v38i5.28254Keywords:
CV: 3D Computer Vision, CV: Language and VisionAbstract
Referring 3D instance segmentation is a challenging task aimed at accurately segmenting a target instance within a 3D scene based on a given referring expression. However, previous methods have overlooked the distinct roles played by different words in referring expressions. Additionally, they have failed to incorporate the positional relationship within referring expressions with the spatial correlations in 3D scenes. To alleviate these issues, we present a novel model called X-RefSeg3D, which constructs a cross-modal graph for the input 3D scene and unites textual and spatial relationships for reasoning via graph neural networks. Our approach begins by capturing object-specific text features, which are then fused with the instance features to construct a comprehensive cross-modal scene graph. Subsequently, we integrate the obtained cross-modal features into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This enables the effective capture of higher-order relationships among instances, thereby enhancing feature fusion and facilitating reasoning. Finally, the refined feature undergoes a matching module to compute the ultimate matching score. Experimental results on ScanRefer demonstrate the effectiveness of our method, surpassing previous approaches by a substantial margin of +3.67% in terms of mIOU.Downloads
Published
2024-03-24
How to Cite
Qian, Z., Ma, Y., Ji, J., & Sun, X. (2024). X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4551-4559. https://doi.org/10.1609/aaai.v38i5.28254
Issue
Section
AAAI Technical Track on Computer Vision IV