Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Authors

  • Jiangfeng Sun Beijing University of Posts and Telecommunications
  • SiHao He Beijing University of Posts and Telecommunications
  • Zhonghong Ou Beijing University of Posts and Telecommunications
  • Meina Song Beijing University of Posts and Telecommunications China University of Petroleum

DOI:

https://doi.org/10.1609/aaai.v40i30.39766

Abstract

Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multi-view contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely-used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically-grounded interactions.

Downloads

Published

2026-03-14

How to Cite

Sun, J., He, S., Ou, Z., & Song, M. (2026). Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25691–25699. https://doi.org/10.1609/aaai.v40i30.39766

Issue

Section

AAAI Technical Track on Machine Learning VII