LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Authors

  • Senqiao Yang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Jiaming Liu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University AI2 Robotics
  • Renrui Zhang The Chinese University of Hong Kong
  • Mingjie Pan State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Ziyu Guo The Chinese University of Hong Kong
  • Xiaoqi Li State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Zehui Chen State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Peng Gao Shanghai Artificial Intelligence Laboratory
  • Hongsheng Li The Chinese University of Hong Kong
  • Yandong Guo AI2 Robotics
  • Shanghang Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v39i9.33001

Abstract

Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D geometric and physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding of LLM. Furthermore, we design a Position-Aware Transformer (PAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM's spatial orientation comprehension of visual features. Our experiments demonstrate that LiDAR-LLM effectively comprehends a wide range of instructions related to 3D scenes, achieving a 40.9 BLEU-1 score on the 3D captioning dataset, a Grounded Captioning accuracy of 63.1%, and a BEV mIoU of 14.3%.

Downloads

Published

2025-04-11

How to Cite

Yang, S., Liu, J., Zhang, R., Pan, M., Guo, Z., Li, X., … Zhang, S. (2025). LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9247–9255. https://doi.org/10.1609/aaai.v39i9.33001

Issue

Section

AAAI Technical Track on Computer Vision VIII