Text to Point Cloud Localization with Multi-Level Negative Contrastive Learning

Dunqiang Liu; Shujun Huang; Wen Li; Siqi Shen; Cheng Wang

doi:10.1609/aaai.v39i5.32574

Authors

Dunqiang Liu Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
Shujun Huang Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
Wen Li Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
Siqi Shen Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
Cheng Wang Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32574

Abstract

Language-based localization is a crucial task in robotics and computer vision, enabling robots to understand spatial positions through language. Recent methods rely on contrastive learning to establish correspondences between global features of texts and point clouds. However, the inherent ambiguity of textual descriptions makes it difficult to convey geometric information accurately, forcing alignment of them in the feature space may compromise the expressiveness of the point clouds. Unlike previous methods, this paper proposes using language as a filter to distinguish dissimilar locations. To this end, we propose a robust framework of multi-level negative contrastive learning for language-based localization, fully leveraging the descriptive power of language for spatial localization. Our method learns multiple mismatched factors by minimizing the similarity of different locations at different levels, including global-level, instance-level and relationlevel, respectively. Extensive experiments conducted on the KITTI360Pose benchmark demonstrate that our method outperforms better that the state-of-the-art methods. Specifically, we achieve a 56.3% improvement in Top-1 retrieval recall and a 45.9% improvement in 5m localization recall.

Text to Point Cloud Localization with Multi-Level Negative Contrastive Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information