RCMoE: A Communication-Efficient Random Compression Framework for Resource-Constrained Mixture-of-Experts Training
DOI:
https://doi.org/10.1609/aaai.v40i32.39899Abstract
Mixture-of-Experts (MoE) architecture with experts parallelism scales LLMs efficiently by activating only a subset of experts per input, avoiding proportional training costs. However, the intensive and heterogeneous communication substantially hinders the efficiency and scalability of MoE training in the resource-constrained scenario. Existing communication compression techniques fall short in MoE training due to: (i) Intensive training amplifies compression overhead, compromising training efficiency; (ii) Accumulated compression errors propagate through the network, degrading training quality. In this paper, we propose RCMoE, a communication-efficient Random Compression framework for MoE training with two core modules: (1) Local-Stochastic Quantization compresses the all-to-all communication by stochastically quantizing each row of the expert's intermediate computing results in parallel, effectively improving the compression efficiency and reducing compression error; (2) Probabilistic Thresholding Sparsification compresses the all-reduce communication by probabilistically sampling large gradients at high probability, thereby reducing the computational complexity and maintaining the convergence efficiency. Experiments on four typical MoE training tasks prove that RCMoE achieves higher 5.9x-8.1x total communication compression ratios and 1.3x-10.1x training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training accuracy.Downloads
Published
2026-03-14
How to Cite
Wu, D., Cai, X., Tan, J., Jia, J., Tan, G., Tao, D., … Tian, Z. (2026). RCMoE: A Communication-Efficient Random Compression Framework for Resource-Constrained Mixture-of-Experts Training. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 26876–26884. https://doi.org/10.1609/aaai.v40i32.39899
Issue
Section
AAAI Technical Track on Machine Learning IX