Capturing Dynamic User Interests Under Modality Imbalance for Multimodal Sequential Recommendation

Zilong Li; Jia Zhu; Chenglei Huang; Zhangze Chen; Hanghui Guo; Guoqing Ma; Jianxia Ling

doi:10.1609/aaai.v40i18.38544

Authors

Zilong Li Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
Jia Zhu Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
Chenglei Huang Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
Zhangze Chen Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
Hanghui Guo School of Computer Science and Engineering, Southeast University
Guoqing Ma Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University
Jianxia Ling Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University

DOI:

https://doi.org/10.1609/aaai.v40i18.38544

Abstract

Multimodal sequential recommender systems leverage diverse modal inputs to enhance the accuracy and relevance of personalized recommendations. However, existing fusion strategies often struggle to capture intricate cross-modal interactions, especially under the evolving dynamics of user intent. Moreover, they frequently neglect modality imbalance issues, leading to suboptimal utilization of multimodal information. To address these challenges, we propose DuAF-MAT, a novel framework for robust multimodal sequential recommendation. Our approach consists of three key components: (1) a Dual-Aware Adaptive Fusion (DuAF) module dynamically calibrates modality contributions by jointly modeling user preferences and temporal information, enabling the extraction of multimodal features aligned with evolving user interests; (2) by integrating Modality Adversarial Training with the Mixture-of-Experts paradigm, MAT-MoE employs an ensemble of expert generators to dynamically reconstruct missing modality representations, effectively mitigating modality imbalance challenges; (3) to address the inherent sparsity of sequential behavior data, we propose a Multi-Supervised Contrastive Learning strategy that integrates cross-modal alignment and virtual sequence augmentation. This approach enhances user interest modeling by leveraging diverse learning signals, resulting in improved model robustness and generalization capability. Extensive experiments on four public datasets demonstrate that DuAF-MAT significantly outperforms state-of-the-art baselines.

Capturing Dynamic User Interests Under Modality Imbalance for Multimodal Sequential Recommendation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information