DiTEA: Mixture-of-Experts for Vision-Language-Action Model in Robotic Manipulation

Chengxuan Li; Xingwan Wang

doi:10.1609/aaai.v40i22.38902

Authors

Chengxuan Li Institute of Automation, Chinese Academy of Science, Beijing, China School of Advanced Manufacturing and Robotics, Peking University, Beijing, China
Xingwan Wang University of Science and Technology of China, Hefei, China

DOI:

https://doi.org/10.1609/aaai.v40i22.38902

Abstract

The current diffusion-based Vision-Language-Action (VLA) models have faster inference speed and the ability to solve the action muti-modality problem in robot manipulation tasks compared to traditional autoregressive models after large-scale pre-training and post-training. However, the diffusion-based VLA models were found to have poor instruction-following ability, and after fine-tuning training on multiple tasks, them often suffer from "skill forgetting" due to conflicting model weights on each task. To address this problem, we propose DiTEA, a Diffusion Transformer-based Mixture-of-Experts (MoE) VLA model. Specifically, it fuses the MoE module into the action head of VLA to form Action MoE, and in addition, we design the Task-Instruction Gate, which uses language instructions to select specific experts for tasks they specialize in, in order to improve the VLA's instruction-following ability. We conducted comprehensive experiments and ablation study to evaluate the efficacy of our model under different designs. Experimental results from simulation and real-world show that our DiTEA has excellent improvement in multi-task compared to baseline and other VLAs.

DiTEA: Mixture-of-Experts for Vision-Language-Action Model in Robotic Manipulation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information