Text to Point Cloud Localization with Multi-Level Negative Contrastive Learning

Authors

  • Dunqiang Liu Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
  • Shujun Huang Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
  • Wen Li Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
  • Siqi Shen Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
  • Cheng Wang Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32574

Abstract

Language-based localization is a crucial task in robotics and computer vision, enabling robots to understand spatial positions through language. Recent methods rely on contrastive learning to establish correspondences between global features of texts and point clouds. However, the inherent ambiguity of textual descriptions makes it difficult to convey geometric information accurately, forcing alignment of them in the feature space may compromise the expressiveness of the point clouds. Unlike previous methods, this paper proposes using language as a filter to distinguish dissimilar locations. To this end, we propose a robust framework of multi-level negative contrastive learning for language-based localization, fully leveraging the descriptive power of language for spatial localization. Our method learns multiple mismatched factors by minimizing the similarity of different locations at different levels, including global-level, instance-level and relationlevel, respectively. Extensive experiments conducted on the KITTI360Pose benchmark demonstrate that our method outperforms better that the state-of-the-art methods. Specifically, we achieve a 56.3% improvement in Top-1 retrieval recall and a 45.9% improvement in 5m localization recall.

Downloads

Published

2025-04-11

How to Cite

Liu, D., Huang, S., Li, W., Shen, S., & Wang, C. (2025). Text to Point Cloud Localization with Multi-Level Negative Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 5397–5405. https://doi.org/10.1609/aaai.v39i5.32574

Issue

Section

AAAI Technical Track on Computer Vision IV