MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Rongyu Zhang; Menghang Dong; Yuan Zhang; Liang Heng; Xiaowei Chi; Gaole Dai; Li Du; Dan Wang; Yuan Du; Shanghang Zhang

doi:10.1609/aaai.v40i22.38945

Authors

Rongyu Zhang The Hong Kong Polytechnic University Nanjing University State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Menghang Dong State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yuan Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Liang Heng State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiaowei Chi Hong Kong University of Science and Technology
Gaole Dai State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Li Du Nanjing University
Dan Wang Hong Kong University of Science and Technology
Yuan Du Nanjing University
Shanghang Zhang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i22.38945

Abstract

Vision-Language-Action (VLA) models enable robotic systems to perform embodied tasks but face deployment challenges due to the high computational demands of the dense Large Language Models (LLMs), with existing early-exit-based sparsification methods often overlooking the critical semantic role of final layers in downstream tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-LayEr Vision Language Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. Specifically, we introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost during the layer-skipping, we devise a Cognitive self-Knowledge Distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in RLBench simulations and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, improving the mean success rate by 9.7% across ten simulation tasks while accelerating inference by 36.8% over OpenVLA.

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information