Generative Planning with 3D-Vision Language Pre-training for End-to-End Autonomous Driving

Authors

  • Tengpeng Li Tongji University
  • Hanli Wang Tongji University
  • Xianfei Li COWAROBOT
  • Wenlong Liao COWAROBOT
  • Tao He COWAROBOT University of South China
  • Pai Peng COWAROBOT

DOI:

https://doi.org/10.1609/aaai.v39i5.32524

Abstract

Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate reasonable planning with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.

Published

2025-04-11

How to Cite

Li, T., Wang, H., Li, X., Liao, W., He, T., & Peng, P. (2025). Generative Planning with 3D-Vision Language Pre-training for End-to-End Autonomous Driving. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 4950–4958. https://doi.org/10.1609/aaai.v39i5.32524

Issue

Section

AAAI Technical Track on Computer Vision IV