Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li; Shuai Yang; Yilun Chen; Xinyi Chen; Xiaoda Yang; Yang Tian; Hanqing Wang; Tai Wang; Dahua Lin; Feng Zhao; Jiangmiao Pang

doi:10.1609/aaai.v40i22.38903

Authors

Hao Li University of Science and Technology of China Shanghai Artificial Intelligence Laboratory
Shuai Yang Shanghai Artificial Intelligence Laboratory Zhejiang University
Yilun Chen Shanghai Artificial Intelligence Laboratory
Xinyi Chen Shanghai Artificial Intelligence Laboratory
Xiaoda Yang Zhejiang University
Yang Tian Shanghai Artificial Intelligence Laboratory
Hanqing Wang Shanghai Artificial Intelligence Laboratory
Tai Wang Shanghai Artificial Intelligence Laboratory
Dahua Lin The Chinese University of Hong Kong
Feng Zhao University of Science and Technology of China
Jiangmiao Pang Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i22.38903

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR, showing the promise of efficient multi-frame adaptation for real-world VLA deployment.

Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information