CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Authors

  • Han Li University of Science and Technology of China
  • Jingwei Sun University of Science and Technology of China
  • Junqing Lin University of Science and Technology of China
  • Guangzhong Sun University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v40i27.39454

Abstract

Mixture of Experts (MoE) models have emerged as a promising approach to scale language models efficiently by activating only a subset of parameters for each input. However, deploying these models under GPU memory constraints remains challenging, as existing offloading strategies incur significant overhead from CPU-GPU data transfers. While prior work has explored prefetching techniques to mitigate this bottleneck, these methods require costly fallback mechanisms when predictions fail. Since expert transfers cannot be canceled once initiated, the correct experts need to be loaded on demand sequentially, introducing additional latency. To address this, we present CommitMoE, a novel approach featuring a Commit Router that makes execution decisions based on expert predictions without fallback mechanisms. Our key insight reveals that router certainty strongly correlates with prediction accuracy, while in low-certainty scenarios, the model output demonstrates inherent robustness to expert selection. Leveraging this insight to design a systems-level solution, CommitMoE achieves 1.3× to 9.4× faster inference across different environments and datasets compared to state-of-the-art offloading frameworks while maintaining model quality.

Downloads

Published

2026-03-14

How to Cite

Li, H., Sun, J., Lin, J., & Sun, G. (2026). CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints. Proceedings of the AAAI Conference on Artificial Intelligence, 40(27), 22904-22912. https://doi.org/10.1609/aaai.v40i27.39454

Issue

Section

AAAI Technical Track on Machine Learning IV