MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Zhitian Xie; Yinger Zhang; Chenyi Zhuang; Qitao Shi; Zhining Liu; Jinjie Gu; Guannan Zhang

doi:10.1609/aaai.v38i14.29539

Authors

Zhitian Xie Ant Group
Yinger Zhang Zhejiang University
Chenyi Zhuang Ant Group
Qitao Shi Ant Group
Zhining Liu Ant Group
Jinjie Gu Ant Group
Guannan Zhang Ant Group

DOI:

https://doi.org/10.1609/aaai.v38i14.29539

Keywords:

ML: Deep Learning Algorithms

Abstract

The application of mixture-of-experts (MoE) is gaining popularity due to its ability to improve model's performance. In an MoE structure, the gate layer plays a significant role in distinguishing and routing input features to different experts. This enables each expert to specialize in processing their corresponding sub-tasks. However, the gate's routing mechanism also gives rise to "narrow vision": the individual MoE's expert fails to use more samples in learning the allocated subtask, which in turn limits the MoE to further improve its generalization ability. To effectively address this, we propose a method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts and gain more accurate perceptions on their allocated sub-tasks. We conduct plenty experiments including tabular, NLP and CV datasets, which shows MoDE's effectiveness, universality and robustness. Furthermore, we develop a parallel study through innovatively constructing "expert probing", to experimentally prove why MoDE works: moderate distilling knowledge from other experts can improve each individual expert's test performances on their assigned tasks, leading to MoE's overall performance improvement.

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription