CasMoE: A Cascaded Framework for Efficient MoE Inference on Resource-constrained Devices

Authors

  • Chengcheng Wang University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
  • Haowen He University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
  • Liang Zhao School of Computer Science, Shenyang Aerospace University, Shenyang, China
  • Xiaoheng Deng School of Computer Science and Engineering, Central South University, Changsha, China
  • Lixin Duan University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
  • Shaohua Wan University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China

DOI:

https://doi.org/10.1609/aaai.v40i31.39816

Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a key enabler for scaling large language models (LLMs), empowering increased model capacity with minimal computational overhead through gating-based dynamic expert activation. However, due to the memory demands introduced by expert modules, MoE inference on resource-constrained devices is still challenging. Existing methods such as model compression and parameter offloading provide partial alleviation but often lead to reduced accuracy or increased latency. In this paper, we propose CasMoE, a general and efficient cascaded framework for accelerating MoE inference on resource-constrained devices. CasMoE employs a two-stage offline-online approach to facilitate efficient expert prefetching. In the offline stage, a parameterized Expert Activation Predictor (EAP) is introduced to accurately predict the corresponding expert activation from the incoming prompt. In the online stage, a non-parametric Expert Activation Matcher (EAM) supporting fast expert retrieval is then integrated with the EAP to form a cascade planner that operates independently of the MoE architecture, predicting activated experts for all MoE layers in a single pass prior to decoding. A gating mechanism is also incorporated to dynamically adjust the sensitivity of the EAM and EAP, enabling a flexible trade-off between inference efficiency and quality. Extensive experiments on diverse downstream tasks demonstrate CasMoE’s effectiveness in accelerating inference while preserving high accuracy.

Downloads

Published

2026-03-14

How to Cite

Wang, C., He, H., Zhao, L., Deng, X., Duan, L., & Wan, S. (2026). CasMoE: A Cascaded Framework for Efficient MoE Inference on Resource-constrained Devices. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26133–26141. https://doi.org/10.1609/aaai.v40i31.39816

Issue

Section

AAAI Technical Track on Machine Learning VIII