LIBA: Language Instructed Multi-granularity Bridge Assistant for 3D Visual Grounding

Yuan Wang; Ya-Li Li; W U Eastman Z Y; Shengjin Wang

doi:10.1609/aaai.v39i8.32875

Authors

Yuan Wang Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
Ya-Li Li Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
W U Eastman Z Y Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China
Shengjin Wang Department of Electronic Engineering, Tsinghua University, China Beijing National Research Center for Information Science and Technology (BNRist), China

DOI:

https://doi.org/10.1609/aaai.v39i8.32875

Abstract

3D Vision Grounding (3D-VG) seeks to unravel referential language and identify targets in 3D physical world. Prevailing methods align with the 2D-VG's pipeline to pinpoint the referred object in a categorical multi-modal reasoning manner. However, the geometric complexities of 3D scenes and the nuanced syntactic structures of language, exacerbates the \textbf{granularity inconsistency} of point cloud and text features, hindering the development of 3D-VG systems in complex scenarios. Towards this issue, we propose LIBA, a Language-Instructed multi-granularity Bridge Assistant tailored for 3D-VG task. LIBA tackles this issue as follows. (1) \textit{How to establish a multi-granularity 3D vision-text feature alignment in a unified model}? We advance a bilateral Dynamic Bridge Adapter (DBA) build multi-granularity interaction of 3D vision and language backnones during feature extraction. We further develop the Language-aware Cross-scale Object Modulation (LCOM) module to integrate multi-scale point cloud features modulated by language information. (2) After aligning multi-modal features, \textit{how to fully harness language model's knowledge to bolster vision concepts understanding}? A LLM-guided Hierarchical Query Selection (LLM-HQS) module incorporates world knowledge of Large Language Model~(LLM) to ground the target referral via an Attribute-then-Relation reasoning process. In this manner, our LIBA inherits reasoning prowess and world knowledge of LLM to bridge point clouds and texts at multiple granularities. Experiments on ScanRefer and Nr3D/Sr3D benchmarks substantiate the superiority of our LIBA, trumping state-of-the-arts by a considerable margin.

LIBA: Language Instructed Multi-granularity Bridge Assistant for 3D Visual Grounding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information