FIRM-MoE:Fine-GrainedExpert Decomposition for Resource-Adaptive MoE Inference

Keyu Chen; Qihang Zhou; Bin Qian; Zhenyu Wen; Wenchao Meng; Shibo He

doi:10.1609/aaai.v40i24.39106

Authors

Keyu Chen Zhejiang University
Qihang Zhou Zhejiang University
Bin Qian Zhejiang University
Zhenyu Wen Zhejiang University of Technology
Wenchao Meng Zhejiang University
Shibo He Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i24.39106

Abstract

Mixture-of-Experts (MoE) is a sparse neural architecture that significantly increases model capacity while maintaining low computational complexity. However, deploying MoE-based large language models (LLMs) on memory-constrained edge devices remains challenging due to their substantial memory requirements. To address this issue, we propose FIRM-MoE, a fine-grained expert offloading framework designed to enable flexible and efficient MoE inference. The core insight of our approach is to reduce the risk of inaccurate expert loading by decomposing each expert into fine-grained sub-experts and then dynamically allocating them through a fine-grained scheduling strategy. To further reduce the error in expert loading, we introduce a multi-layer expert prediction mechanism and a resource-adaptive expert pre-loading algorithm to enable more robust expert allocation. This design allows our model to achieve more efficient expert utilization and improved resilience to prediction errors. We conduct extensive experiments to demonstrate the superiority of FIRM-MoE across diverse memory constraints. The results show that FIRM-MoE achieves up to 1.5× speedup and 2.8× memory savings in decoding, compared to state-of-the-art MoE offloading strategies.

FIRM-MoE:Fine-GrainedExpert Decomposition for Resource-Adaptive MoE Inference

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information