CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Han Li; Jingwei Sun; Junqing Lin; Guangzhong Sun

doi:10.1609/aaai.v40i27.39454

Authors

Han Li University of Science and Technology of China
Jingwei Sun University of Science and Technology of China
Junqing Lin University of Science and Technology of China
Guangzhong Sun University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i27.39454

Abstract

Mixture of Experts (MoE) models have emerged as a promising approach to scale language models efficiently by activating only a subset of parameters for each input. However, deploying these models under GPU memory constraints remains challenging, as existing offloading strategies incur significant overhead from CPU-GPU data transfers. While prior work has explored prefetching techniques to mitigate this bottleneck, these methods require costly fallback mechanisms when predictions fail. Since expert transfers cannot be canceled once initiated, the correct experts need to be loaded on demand sequentially, introducing additional latency. To address this, we present CommitMoE, a novel approach featuring a Commit Router that makes execution decisions based on expert predictions without fallback mechanisms. Our key insight reveals that router certainty strongly correlates with prediction accuracy, while in low-certainty scenarios, the model output demonstrates inherent robustness to expert selection. Leveraging this insight to design a systems-level solution, CommitMoE achieves 1.3× to 9.4× faster inference across different environments and datasets compared to state-of-the-art offloading frameworks while maintaining model quality.

CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information