LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Senqiao Yang; Jiaming Liu; Renrui Zhang; Mingjie Pan; Ziyu Guo; Xiaoqi Li; Zehui Chen; Peng Gao; Hongsheng Li; Yandong Guo; Shanghang Zhang

doi:10.1609/aaai.v39i9.33001

Authors

Senqiao Yang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Jiaming Liu State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University AI2 Robotics
Renrui Zhang The Chinese University of Hong Kong
Mingjie Pan State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Ziyu Guo The Chinese University of Hong Kong
Xiaoqi Li State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zehui Chen State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Peng Gao Shanghai Artificial Intelligence Laboratory
Hongsheng Li The Chinese University of Hong Kong
Yandong Guo AI2 Robotics
Shanghang Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v39i9.33001

Abstract

Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D geometric and physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding of LLM. Furthermore, we design a Position-Aware Transformer (PAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM's spatial orientation comprehension of visual features. Our experiments demonstrate that LiDAR-LLM effectively comprehends a wide range of instructions related to 3D scenes, achieving a 40.9 BLEU-1 score on the 3D captioning dataset, a Grounded Captioning accuracy of 63.1%, and a BEV mIoU of 14.3%.

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information