Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models

Haodong Li; Lianyong Qi; Weiming Liu; Fan Wang; Chong Li; Shengye Pang; Wenwen Gong; Yanwei Xu; Xiaoxiao Chi; Yang Zhang; Xiaokang Zhou

doi:10.1609/aaai.v40i18.38533

Authors

Haodong Li College of Computer Science and Technology, China University of Petroleum (East China), China Shandong Key Laboratory of Intelligent Oil and Gas Industrial Software, China
Lianyong Qi College of Computer Science and Technology, China University of Petroleum (East China), China Shandong Key Laboratory of Intelligent Oil and Gas Industrial Software, China
Weiming Liu ByteDance Inc., Singapore
Fan Wang College of Computer Science and Technology, Zhejiang University, China
Chong Li College of Computer Science and Technology, China University of Petroleum (East China), China Shandong Key Laboratory of Intelligent Oil and Gas Industrial Software, China
Shengye Pang School of Computer Engineering and Science, Shanghai University, China
Wenwen Gong College of Information and Electrical Engineering, China Agricultural University, China
Yanwei Xu School of Computer Science, Peking University, China
Xiaoxiao Chi School of Computing, Macquarie University, Australia
Yang Zhang Anuradha and Vikas Sinha Department of Data Science, University of North Texas, USA
Xiaokang Zhou Faculty of Business and Data Science, Kansai University, Japan RIKEN Center for Advanced Intelligence Project, Japan

DOI:

https://doi.org/10.1609/aaai.v40i18.38533

Abstract

Multimedia content offers additional context for recommender systems to better understand user interests. Existing studies on multimodal recommendation primarily focus on constructing item-item semantic graphs. However, most of these methods capture only shallow semantic structures based on feature similarity and struggle to model more complex or cross-entity semantic relationships (e.g., user-item). Moreover, in these methods, collaborative signals often dominate and suppress semantic knowledge, which limits its role in representation learning. To address these issues, we propose SCALE, a novel framework that combines subspace-aware graph construction and contrastive alignment for multimodal recommendation with large language models. Specifically, we first use large language models and encoders to extract user and item features. Following the subspace clustering assumption, we apply the Orthogonal Matching Pursuit algorithm to mine complex semantic structures within the item-item, user-user, and user-item spaces, and integrate them into a unified semantic graph. We then perform graph convolution on both the semantic and interaction graphs, and aggregate the results for recommendation. Furthermore, contrastive losses are employed to enhance semantic fusion and alignment. Extensive experiments on five real-world datasets demonstrate that SCALE significantly outperforms state-of-the-art multimodal recommendation models, highlighting its effectiveness in modeling complex relationships and integrating semantic knowledge with collaborative signals.

Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information