Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning
DOI:
https://doi.org/10.1609/aaai.v40i30.39766Abstract
Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multi-view contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely-used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically-grounded interactions.Published
2026-03-14
How to Cite
Sun, J., He, S., Ou, Z., & Song, M. (2026). Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25691–25699. https://doi.org/10.1609/aaai.v40i30.39766
Issue
Section
AAAI Technical Track on Machine Learning VII