FT-MoE: Sustainable-learning Mixture of Experts for Fault-Tolerant Computing

Authors

  • Wenjing Xiao Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, 530004, China School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
  • Wenhao Song Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, 530004, China School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
  • Miaojiang Chen Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, 530004, China School of Computer, Electronics and Information, Guangxi University, Nanning, 530004, China
  • Min Chen School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China Pazhou Laboratory, Guangzhou 510330, China

DOI:

https://doi.org/10.1609/aaai.v40i19.38634

Abstract

Intelligent fault-tolerant (FT) computing has recently demonstrated significant advantages in predicting and diagnosing faults proactively, thereby ensuring reliable service delivery. However, due to the heterogeneity of fault knowledge, dynamic workloads, and limited data support, existing deep learning-based FT algorithms face challenges in fault detection quality and training efficiency. This is primarily because their homogenization of fault knowledge perception difficuties to fully capture diverse and complex fault patterns. To address these challenges, we propose FT-MoE, a sustainable-learning fault-tolerant computing framework based on a dual-path architecture for high-accuracy fault detection and classification. This model employs a mixture-of-experts (MoE) architecture, enabling different parameters to learn distinct fault knowledge. Additionally, we adopt a two-stage learning scheme that combines comprehensive offline training with continual online tuning, allowing the model to adaptively optimize its parameters in response to evolving real-time workloads. To facilitate realistic evaluation, we construct a new fault detection and classification dataset for edge networks, comprising 10,000 intervals with fine-grained resource features, surpassing existing datasets in both scale and granularity. Finally, we conduct extensive experiments on the FT benchmark to verify the effectiveness of FT-MoE. Results demonstrate that our model outperforms state-of-the-art methods.

Downloads

Published

2026-03-14

How to Cite

Xiao, W., Song, W., Chen, M., & Chen, M. (2026). FT-MoE: Sustainable-learning Mixture of Experts for Fault-Tolerant Computing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16004–16012. https://doi.org/10.1609/aaai.v40i19.38634

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management III