10 Open Challenges Steering the Future of Vision-Language-Action Models

Soujanya Poria; Navonil Majumder; Chia-Yu Hung; Amir Ali Bagherzadeh; Chuan Li; Kenneth Kwok; Ziwei Wang; Cheston Tan; Jiajun Wu; David Hsu

doi:10.1609/aaai.v40i46.41333

10 Open Challenges Steering the Future of Vision-Language-Action Models

Authors

Soujanya Poria Nanyang Technological University
Navonil Majumder SUTD
Chia-Yu Hung Nanyang Technological University
Amir Ali Bagherzadeh Lambda Labs
Chuan Li Lambda Labs
Kenneth Kwok A*STAR
Ziwei Wang Nanyang Technological University
Cheston Tan A*STAR
Jiajun Wu Stanford University
David Hsu National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v40i46.41333

Abstract

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly preva- lent in the embodied AI arena, following the widespread suc- cess of their precursors—LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing develop- ment of VLA models—multimodality, reasoning, data, eval- uation, cross-robkot action generalization, efficiency, whole- body coordination, safety, agents, and coordination with hu- mans. Furthermore, we discuss the emerging trends of us- ing spatial understanding, modeling world dynamics, post training, and data synthesis—all aiming to reach these mile- stones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Poria, S., Majumder, N., Hung, C.-Y., Bagherzadeh, A. A., Li, C., Kwok, K., … Hsu, D. (2026). 10 Open Challenges Steering the Future of Vision-Language-Action Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(46), 39771–39779. https://doi.org/10.1609/aaai.v40i46.41333

Download Citation

Issue

Vol. 40 No. 46: AAAI-26 Special Track AI for Social Impact II and Senior Member Presentations

Section

Senior Member Presentation

10 Open Challenges Steering the Future of Vision-Language-Action Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information