OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

Xingcheng Zhou; Xuyuan Han; Feng Yang; Yunpu Ma; Volker Tresp; Alois Knoll

doi:10.1609/aaai.v40i16.38386

Authors

Xingcheng Zhou Technical University of Munich
Xuyuan Han Technical University of Munich
Feng Yang Technical University of Munich
Yunpu Ma Ludwig Maximilian University of Munich
Volker Tresp Ludwig Maximilian University of Munich
Alois Knoll Technical University of Munich

DOI:

https://doi.org/10.1609/aaai.v40i16.38386

Abstract

We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially-grounded driving actions by leveraging multimodal inputs, including both 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent–environment–ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate its superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information