MDF: A Modality-Aware Disentanglement and Fusion Framework for Multimodal Sentiment Analysis

Authors

  • Zhongquan Jian School of Computer and Data Science, Minjiang University, Fuzhou, China
  • Wenhan Lv School of Film, Xiamen University, Xiamen, China
  • Yanhao Chen School of Film, Xiamen University, Xiamen, China
  • Guanran Luo School of Informatics, Xiamen University, Xiamen, China
  • Wentao Qiu School of Informatics, Xiamen University, Xiamen, China
  • Shaopan Wang School of Informatics, Xiamen University, Xiamen, China
  • Bingbing Hu School of Film, Xiamen University, Xiamen, China
  • Qingqiang Wu School of Film, Xiamen University, Xiamen, China School of Informatics, Xiamen University, Xiamen, China

DOI:

https://doi.org/10.1609/aaai.v40i37.40392

Abstract

The homogeneity and heterogeneity across modalities are critical factors that influence multimodal fusion. In Multimodal Sentiment Analysis (MSA), the inherent textual information within the audio modality induces cross-modality homogeneity with the text modality. Conversely, the mutual independence between text and vision modalities results in their cross-modal heterogeneity. Although existing disentangle-based methods achieve notable performance gains by separating modality features into distinct subspaces, they overlook the characteristics of cross-modality heterogeneity and homogeneity among different modalities. To this end, we propose a novel Modality-aware Disentangle and Fusion (MDF) framework to investigate the role of core modality features. Specifically, we first use text as the anchor to disentangle the audio modality and extract its unique modality-specific features, thereby establishing cross-modal heterogeneity among text, audio, and vision. We then introduce a Cross-Modality Heterogeneity Enhancement (CHE) module to refine these features, further reinforcing their heterogeneous characteristics. Finally, a Modality Adaptive Weighting (MAW) module is employed to dynamically assign weights to the text, sound, and vision modalities based on their potential contributions to sentiment prediction, achieving a more effective multimodal representation for MSA. Experimental evaluations on different benchmarks demonstrate MDF's superiority, with extensive ablation studies confirming its effectiveness.

Downloads

Published

2026-03-14

How to Cite

Jian, Z., Lv, W., Chen, Y., Luo, G., Qiu, W., Wang, S., Hu, B., & Wu, Q. (2026). MDF: A Modality-Aware Disentanglement and Fusion Framework for Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31292-31300. https://doi.org/10.1609/aaai.v40i37.40392

Issue

Section

AAAI Technical Track on Natural Language Processing II