Capturing Dynamic User Interests Under Modality Imbalance for Multimodal Sequential Recommendation

Authors

  • Zilong Li Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
  • Jia Zhu Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
  • Chenglei Huang Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
  • Zhangze Chen Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
  • Hanghui Guo School of Computer Science and Engineering, Southeast University
  • Guoqing Ma Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
  • Jianxia Ling Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University

DOI:

https://doi.org/10.1609/aaai.v40i18.38544

Abstract

Multimodal sequential recommender systems leverage diverse modal inputs to enhance the accuracy and relevance of personalized recommendations. However, existing fusion strategies often struggle to capture intricate cross-modal interactions, especially under the evolving dynamics of user intent. Moreover, they frequently neglect modality imbalance issues, leading to suboptimal utilization of multimodal information. To address these challenges, we propose DuAF-MAT, a novel framework for robust multimodal sequential recommendation. Our approach consists of three key components: (1) a Dual-Aware Adaptive Fusion (DuAF) module dynamically calibrates modality contributions by jointly modeling user preferences and temporal information, enabling the extraction of multimodal features aligned with evolving user interests; (2) by integrating Modality Adversarial Training with the Mixture-of-Experts paradigm, MAT-MoE employs an ensemble of expert generators to dynamically reconstruct missing modality representations, effectively mitigating modality imbalance challenges; (3) to address the inherent sparsity of sequential behavior data, we propose a Multi-Supervised Contrastive Learning strategy that integrates cross-modal alignment and virtual sequence augmentation. This approach enhances user interest modeling by leveraging diverse learning signals, resulting in improved model robustness and generalization capability. Extensive experiments on four public datasets demonstrate that DuAF-MAT significantly outperforms state-of-the-art baselines.

Downloads

Published

2026-03-14

How to Cite

Li, Z., Zhu, J., Huang, C., Chen, Z., Guo, H., Ma, G., & Ling, J. (2026). Capturing Dynamic User Interests Under Modality Imbalance for Multimodal Sequential Recommendation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(18), 15198–15206. https://doi.org/10.1609/aaai.v40i18.38544

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management II