LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
DOI:
https://doi.org/10.1609/aaai.v39i9.33001Abstract
Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D geometric and physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding of LLM. Furthermore, we design a Position-Aware Transformer (PAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM's spatial orientation comprehension of visual features. Our experiments demonstrate that LiDAR-LLM effectively comprehends a wide range of instructions related to 3D scenes, achieving a 40.9 BLEU-1 score on the 3D captioning dataset, a Grounded Captioning accuracy of 63.1%, and a BEV mIoU of 14.3%.Downloads
Published
2025-04-11
How to Cite
Yang, S., Liu, J., Zhang, R., Pan, M., Guo, Z., Li, X., … Zhang, S. (2025). LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9247–9255. https://doi.org/10.1609/aaai.v39i9.33001
Issue
Section
AAAI Technical Track on Computer Vision VIII