DiTEA: Mixture-of-Experts for Vision-Language-Action Model in Robotic Manipulation

Authors

  • Chengxuan Li Institute of Automation, Chinese Academy of Science, Beijing, China School of Advanced Manufacturing and Robotics, Peking University, Beijing, China
  • Xingwan Wang University of Science and Technology of China, Hefei, China

DOI:

https://doi.org/10.1609/aaai.v40i22.38902

Abstract

The current diffusion-based Vision-Language-Action (VLA) models have faster inference speed and the ability to solve the action muti-modality problem in robot manipulation tasks compared to traditional autoregressive models after large-scale pre-training and post-training. However, the diffusion-based VLA models were found to have poor instruction-following ability, and after fine-tuning training on multiple tasks, them often suffer from "skill forgetting" due to conflicting model weights on each task. To address this problem, we propose DiTEA, a Diffusion Transformer-based Mixture-of-Experts (MoE) VLA model. Specifically, it fuses the MoE module into the action head of VLA to form Action MoE, and in addition, we design the Task-Instruction Gate, which uses language instructions to select specific experts for tasks they specialize in, in order to improve the VLA's instruction-following ability. We conducted comprehensive experiments and ablation study to evaluate the efficacy of our model under different designs. Experimental results from simulation and real-world show that our DiTEA has excellent improvement in multi-task compared to baseline and other VLAs.

Published

2026-03-14

How to Cite

Li, C., & Wang, X. (2026). DiTEA: Mixture-of-Experts for Vision-Language-Action Model in Robotic Manipulation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18379–18387. https://doi.org/10.1609/aaai.v40i22.38902

Issue

Section

AAAI Technical Track on Intelligent Robotics