KSS-MoE: Knowledge Space Synergy Framework in Mixture of Experts for Continual Visual Instruction Tuning

Authors

  • Lingyun Song School of Computer Science, Northwestern Polytechnical University, Xi'an Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, Jinhua Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen
  • Ziyao Chen School of Computer Science, Northwestern Polytechnical University, Xi'an
  • Kang Pan Independent Researcher
  • Xiaolin Han School of Computer Science, Northwestern Polytechnical University, Xi'an
  • Xinbiao Gan School of Computer Science, National University of Defense Technology, Changsha
  • Yudai Pan School of Computer Science, Northwestern Polytechnical University, Xi'an
  • Xiaofan Sun School of Computer Science, Northwestern Polytechnical University, Xi'an
  • Xiaoqi Wang School of Computer Science, Northwestern Polytechnical University, Xi'an
  • Xuequn Shang Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an

DOI:

https://doi.org/10.1609/aaai.v40i30.39749

Abstract

Multimodal Large Language Models (MLLMs) employing the Mixture-of-Experts (MoE) structure exhibit encouraging results in visual language tasks. However, they struggle with catastrophic forgetting due to a lack of effective collaboration among experts and negative transfer across tasks. This happens because the router typically employed in MoE for managing expert assignments is inadequate when there are significant shifts in data distribution across various tasks. A drop in the effectiveness of earlier tasks is caused by negative transfer, which occurs due to conflicts in shared knowledge between tasks, disturbing the knowledge already acquired. To address these issues, we propose the Knowledge Space Synergy Framework in Mixture of Experts (KSS-MoE) for Continual Visual Instruction Tuning (CVIT). It dynamically combines the knowledge subspaces of experts to improve the integration of fine-grained complementary knowledge and collaborative abilities of experts, thus addressing the limitations of the basic router. Furthermore, we introduce a general expert that maintains orthogonal subspaces for shared knowledge, enabling effective cross-task knowledge utilization while reducing negative transfer. Extensive experiments conducted on eight CVIT tasks confirm the excellence of KSS-MoE, showcasing its top-tier performance.

Downloads

Published

2026-03-14

How to Cite

Song, L., Chen, Z., Pan, K., Han, X., Gan, X., Pan, Y., … Shang, X. (2026). KSS-MoE: Knowledge Space Synergy Framework in Mixture of Experts for Continual Visual Instruction Tuning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25536–25544. https://doi.org/10.1609/aaai.v40i30.39749

Issue

Section

AAAI Technical Track on Machine Learning VII