Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Authors

  • Hao Li University of Science and Technology of China Shanghai Artificial Intelligence Laboratory
  • Shuai Yang Shanghai Artificial Intelligence Laboratory Zhejiang University
  • Yilun Chen Shanghai Artificial Intelligence Laboratory
  • Xinyi Chen Shanghai Artificial Intelligence Laboratory
  • Xiaoda Yang Zhejiang University
  • Yang Tian Shanghai Artificial Intelligence Laboratory
  • Hanqing Wang Shanghai Artificial Intelligence Laboratory
  • Tai Wang Shanghai Artificial Intelligence Laboratory
  • Dahua Lin The Chinese University of Hong Kong
  • Feng Zhao University of Science and Technology of China
  • Jiangmiao Pang Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i22.38903

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR, showing the promise of efficient multi-frame adaptation for real-world VLA deployment.

Published

2026-03-14

How to Cite

Li, H., Yang, S., Chen, Y., Chen, X., Yang, X., Tian, Y., … Pang, J. (2026). Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18388–18396. https://doi.org/10.1609/aaai.v40i22.38903

Issue

Section

AAAI Technical Track on Intelligent Robotics