OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
DOI:
https://doi.org/10.1609/aaai.v40i16.38386Abstract
We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially-grounded driving actions by leveraging multimodal inputs, including both 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent–environment–ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate its superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.Published
2026-03-14
How to Cite
Zhou, X., Han, X., Yang, F., Ma, Y., Tresp, V., & Knoll, A. (2026). OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13782–13790. https://doi.org/10.1609/aaai.v40i16.38386
Issue
Section
AAAI Technical Track on Computer Vision XIII