MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Authors

  • Rongyu Zhang The Hong Kong Polytechnic University Nanjing University State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Menghang Dong State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Yuan Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Liang Heng State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Xiaowei Chi Hong Kong University of Science and Technology
  • Gaole Dai State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
  • Li Du Nanjing University
  • Dan Wang Hong Kong University of Science and Technology
  • Yuan Du Nanjing University
  • Shanghang Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i22.38945

Abstract

Vision-Language-Action (VLA) models enable robotic systems to perform embodied tasks but face deployment challenges due to the high computational demands of the dense Large Language Models (LLMs), with existing early-exit-based sparsification methods often overlooking the critical semantic role of final layers in downstream tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-LayEr Vision Language Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. Specifically, we introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost during the layer-skipping, we devise a Cognitive self-Knowledge Distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in RLBench simulations and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, improving the mean success rate by 9.7% across ten simulation tasks while accelerating inference by 36.8% over OpenVLA.

Downloads

Published

2026-03-14

How to Cite

Zhang, R., Dong, M., Zhang, Y., Heng, L., Chi, X., Dai, G., … Zhang, S. (2026). MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18764–18772. https://doi.org/10.1609/aaai.v40i22.38945

Issue

Section

AAAI Technical Track on Intelligent Robotics