Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models

Authors

  • Haodong Li College of Computer Science and Technology, China University of Petroleum (East China), China Shandong Key Laboratory of Intelligent Oil and Gas Industrial Software, China
  • Lianyong Qi College of Computer Science and Technology, China University of Petroleum (East China), China Shandong Key Laboratory of Intelligent Oil and Gas Industrial Software, China
  • Weiming Liu ByteDance Inc., Singapore
  • Fan Wang College of Computer Science and Technology, Zhejiang University, China
  • Chong Li College of Computer Science and Technology, China University of Petroleum (East China), China Shandong Key Laboratory of Intelligent Oil and Gas Industrial Software, China
  • Shengye Pang School of Computer Engineering and Science, Shanghai University, China
  • Wenwen Gong College of Information and Electrical Engineering, China Agricultural University, China
  • Yanwei Xu School of Computer Science, Peking University, China
  • Xiaoxiao Chi School of Computing, Macquarie University, Australia
  • Yang Zhang Anuradha and Vikas Sinha Department of Data Science, University of North Texas, USA
  • Xiaokang Zhou Faculty of Business and Data Science, Kansai University, Japan RIKEN Center for Advanced Intelligence Project, Japan

DOI:

https://doi.org/10.1609/aaai.v40i18.38533

Abstract

Multimedia content offers additional context for recommender systems to better understand user interests. Existing studies on multimodal recommendation primarily focus on constructing item-item semantic graphs. However, most of these methods capture only shallow semantic structures based on feature similarity and struggle to model more complex or cross-entity semantic relationships (e.g., user-item). Moreover, in these methods, collaborative signals often dominate and suppress semantic knowledge, which limits its role in representation learning. To address these issues, we propose SCALE, a novel framework that combines subspace-aware graph construction and contrastive alignment for multimodal recommendation with large language models. Specifically, we first use large language models and encoders to extract user and item features. Following the subspace clustering assumption, we apply the Orthogonal Matching Pursuit algorithm to mine complex semantic structures within the item-item, user-user, and user-item spaces, and integrate them into a unified semantic graph. We then perform graph convolution on both the semantic and interaction graphs, and aggregate the results for recommendation. Furthermore, contrastive losses are employed to enhance semantic fusion and alignment. Extensive experiments on five real-world datasets demonstrate that SCALE significantly outperforms state-of-the-art multimodal recommendation models, highlighting its effectiveness in modeling complex relationships and integrating semantic knowledge with collaborative signals.

Downloads

Published

2026-03-14

How to Cite

Li, H., Qi, L., Liu, W., Wang, F., Li, C., Pang, S., … Zhou, X. (2026). Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(18), 15099–15107. https://doi.org/10.1609/aaai.v40i18.38533

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management II