MDF: A Modality-Aware Disentanglement and Fusion Framework for Multimodal Sentiment Analysis

Zhongquan Jian; Wenhan Lv; Yanhao Chen; Guanran Luo; Wentao Qiu; Shaopan Wang; Bingbing Hu; Qingqiang Wu

doi:10.1609/aaai.v40i37.40392

Authors

Zhongquan Jian School of Computer and Data Science, Minjiang University, Fuzhou, China
Wenhan Lv School of Film, Xiamen University, Xiamen, China
Yanhao Chen School of Film, Xiamen University, Xiamen, China
Guanran Luo School of Informatics, Xiamen University, Xiamen, China
Wentao Qiu School of Informatics, Xiamen University, Xiamen, China
Shaopan Wang School of Informatics, Xiamen University, Xiamen, China
Bingbing Hu School of Film, Xiamen University, Xiamen, China
Qingqiang Wu School of Film, Xiamen University, Xiamen, China School of Informatics, Xiamen University, Xiamen, China

DOI:

https://doi.org/10.1609/aaai.v40i37.40392

Abstract

The homogeneity and heterogeneity across modalities are critical factors that influence multimodal fusion. In Multimodal Sentiment Analysis (MSA), the inherent textual information within the audio modality induces cross-modality homogeneity with the text modality. Conversely, the mutual independence between text and vision modalities results in their cross-modal heterogeneity. Although existing disentangle-based methods achieve notable performance gains by separating modality features into distinct subspaces, they overlook the characteristics of cross-modality heterogeneity and homogeneity among different modalities. To this end, we propose a novel Modality-aware Disentangle and Fusion (MDF) framework to investigate the role of core modality features. Specifically, we first use text as the anchor to disentangle the audio modality and extract its unique modality-specific features, thereby establishing cross-modal heterogeneity among text, audio, and vision. We then introduce a Cross-Modality Heterogeneity Enhancement (CHE) module to refine these features, further reinforcing their heterogeneous characteristics. Finally, a Modality Adaptive Weighting (MAW) module is employed to dynamically assign weights to the text, sound, and vision modalities based on their potential contributions to sentiment prediction, achieving a more effective multimodal representation for MSA. Experimental evaluations on different benchmarks demonstrate MDF's superiority, with extensive ablation studies confirming its effectiveness.

MDF: A Modality-Aware Disentanglement and Fusion Framework for Multimodal Sentiment Analysis

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information