Cross-Modal Match for Language Conditioned 3D Object Grounding

Yachao Zhang; Runze Hu; Ronghui Li; Yanyun Qu; Yuan Xie; Xiu Li

doi:10.1609/aaai.v38i7.28566

Authors

Yachao Zhang Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
Runze Hu School of Information and Electronics, Beijing Institute of Technology, Beijing, 100081, China
Ronghui Li Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
Yanyun Qu School of Informatics, Xiamen University, Xiamen, 361000, China
Yuan Xie School of Computer Science and Technology, East China Normal University, Shanghai, 200062, China
Xiu Li Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

DOI:

https://doi.org/10.1609/aaai.v38i7.28566

Keywords:

CV: 3D Computer Vision, CV: Language and Vision, CV: Multi-modal Vision, CV: Scene Analysis & Understanding

Abstract

Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.

Cross-Modal Match for Language Conditioned 3D Object Grounding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription