Text to Point Cloud Localization with Relation-Enhanced Transformer

Authors

  • Guangzhi Wang National University of Singapore
  • Hehe Fan National University of Singapore
  • Mohan Kankanhalli National University of Singapore,

DOI:

https://doi.org/10.1609/aaai.v37i2.25347

Keywords:

CV: 3D Computer Vision, CV: Multi-modal Vision

Abstract

Automatically localizing a position based on a few natural language instructions is essential for future robots to communicate and collaborate with humans. To approach this goal, we focus on a text-to-point-cloud cross-modal localization problem. Given a textual query, it aims to identify the described location from city-scale point clouds. The task involves two challenges. 1) In city-scale point clouds, similar ambient instances may exist in several locations. Searching each location in a huge point cloud with only instances as guidance may lead to less discriminative signals and incorrect results. 2) In textual descriptions, the hints are provided separately. In this case, the relations among those hints are not explicitly described, leaving the difficulties of learning relations to the agent itself. To alleviate the two challenges, we propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability for both point cloud and nature language queries. The core of the proposed RET is a novel Relation-enhanced Self-Attention (RSA) mechanism, which explicitly encodes instance (hint)-wise relations for the two modalities. Moreover, we propose a fine-grained cross-modal matching method to further refine the location predictions in a subsequent instance-hint matching stage. Experimental results on the KITTI360Pose dataset demonstrate that our approach surpasses the previous state-of-the-art method by large margins.

Downloads

Published

2023-06-26

How to Cite

Wang, G., Fan, H., & Kankanhalli, M. (2023). Text to Point Cloud Localization with Relation-Enhanced Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2501-2509. https://doi.org/10.1609/aaai.v37i2.25347

Issue

Section

AAAI Technical Track on Computer Vision II