CasMoE: A Cascaded Framework for Efficient MoE Inference on Resource-constrained Devices

Chengcheng Wang; Haowen He; Liang Zhao; Xiaoheng Deng; Lixin Duan; Shaohua Wan

doi:10.1609/aaai.v40i31.39816

Authors

Chengcheng Wang University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
Haowen He University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
Liang Zhao School of Computer Science, Shenyang Aerospace University, Shenyang, China
Xiaoheng Deng School of Computer Science and Engineering, Central South University, Changsha, China
Lixin Duan University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
Shaohua Wan University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China

DOI:

https://doi.org/10.1609/aaai.v40i31.39816

Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a key enabler for scaling large language models (LLMs), empowering increased model capacity with minimal computational overhead through gating-based dynamic expert activation. However, due to the memory demands introduced by expert modules, MoE inference on resource-constrained devices is still challenging. Existing methods such as model compression and parameter offloading provide partial alleviation but often lead to reduced accuracy or increased latency. In this paper, we propose CasMoE, a general and efficient cascaded framework for accelerating MoE inference on resource-constrained devices. CasMoE employs a two-stage offline-online approach to facilitate efficient expert prefetching. In the offline stage, a parameterized Expert Activation Predictor (EAP) is introduced to accurately predict the corresponding expert activation from the incoming prompt. In the online stage, a non-parametric Expert Activation Matcher (EAM) supporting fast expert retrieval is then integrated with the EAP to form a cascade planner that operates independently of the MoE architecture, predicting activated experts for all MoE layers in a single pass prior to decoding. A gating mechanism is also incorporated to dynamically adjust the sensitivity of the EAM and EAP, enabling a flexible trade-off between inference efficiency and quality. Extensive experiments on diverse downstream tasks demonstrate CasMoE’s effectiveness in accelerating inference while preserving high accuracy.

CasMoE: A Cascaded Framework for Efficient MoE Inference on Resource-constrained Devices

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information