FIRM-MoE:Fine-GrainedExpert Decomposition for Resource-Adaptive MoE Inference
DOI:
https://doi.org/10.1609/aaai.v40i24.39106Abstract
Mixture-of-Experts (MoE) is a sparse neural architecture that significantly increases model capacity while maintaining low computational complexity. However, deploying MoE-based large language models (LLMs) on memory-constrained edge devices remains challenging due to their substantial memory requirements. To address this issue, we propose FIRM-MoE, a fine-grained expert offloading framework designed to enable flexible and efficient MoE inference. The core insight of our approach is to reduce the risk of inaccurate expert loading by decomposing each expert into fine-grained sub-experts and then dynamically allocating them through a fine-grained scheduling strategy. To further reduce the error in expert loading, we introduce a multi-layer expert prediction mechanism and a resource-adaptive expert pre-loading algorithm to enable more robust expert allocation. This design allows our model to achieve more efficient expert utilization and improved resilience to prediction errors. We conduct extensive experiments to demonstrate the superiority of FIRM-MoE across diverse memory constraints. The results show that FIRM-MoE achieves up to 1.5× speedup and 2.8× memory savings in decoding, compared to state-of-the-art MoE offloading strategies.Published
2026-03-14
How to Cite
Chen, K., Zhou, Q., Qian, B., Wen, Z., Meng, W., & He, S. (2026). FIRM-MoE:Fine-GrainedExpert Decomposition for Resource-Adaptive MoE Inference. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 20190-20198. https://doi.org/10.1609/aaai.v40i24.39106
Issue
Section
AAAI Technical Track on Machine Learning I