LIBA: Language Instructed Multi-granularity Bridge Assistant for 3D Visual Grounding
DOI:
https://doi.org/10.1609/aaai.v39i8.32875Abstract
3D Vision Grounding (3D-VG) seeks to unravel referential language and identify targets in 3D physical world. Prevailing methods align with the 2D-VG's pipeline to pinpoint the referred object in a categorical multi-modal reasoning manner. However, the geometric complexities of 3D scenes and the nuanced syntactic structures of language, exacerbates the \textbf{granularity inconsistency} of point cloud and text features, hindering the development of 3D-VG systems in complex scenarios. Towards this issue, we propose LIBA, a Language-Instructed multi-granularity Bridge Assistant tailored for 3D-VG task. LIBA tackles this issue as follows. (1) \textit{How to establish a multi-granularity 3D vision-text feature alignment in a unified model}? We advance a bilateral Dynamic Bridge Adapter (DBA) build multi-granularity interaction of 3D vision and language backnones during feature extraction. We further develop the Language-aware Cross-scale Object Modulation (LCOM) module to integrate multi-scale point cloud features modulated by language information. (2) After aligning multi-modal features, \textit{how to fully harness language model's knowledge to bolster vision concepts understanding}? A LLM-guided Hierarchical Query Selection (LLM-HQS) module incorporates world knowledge of Large Language Model~(LLM) to ground the target referral via an Attribute-then-Relation reasoning process. In this manner, our LIBA inherits reasoning prowess and world knowledge of LLM to bridge point clouds and texts at multiple granularities. Experiments on ScanRefer and Nr3D/Sr3D benchmarks substantiate the superiority of our LIBA, trumping state-of-the-arts by a considerable margin.Downloads
Published
2025-04-11
How to Cite
Wang, Y., Li, Y.-L., Y, W. U. E. Z., & Wang, S. (2025). LIBA: Language Instructed Multi-granularity Bridge Assistant for 3D Visual Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 8114–8122. https://doi.org/10.1609/aaai.v39i8.32875
Issue
Section
AAAI Technical Track on Computer Vision VII