Multi-to-Single: Reducing Multimodal Dependency in Emotion Recognition Through Contrastive Learning
DOI:
https://doi.org/10.1609/aaai.v39i2.32134Abstract
Multimodal emotion recognition is a crucial research area in the field of affective brain-computer interfaces. However, in practical applications, it is often challenging to obtain all modalities simultaneously. To deal with this problem, researchers focus on using cross-modal methods to learn multimodal representations with fewer modalities. However, due to the significant differences in the distribution of different modalities, it is challenging to enable any modality to fully learn multimodal features. To address this limitation, we propose a Multi-to-Single (M2S) emotion recognition model, leveraging contrastive learning and incorporating two innovative modules: 1) a spatial and temporal-sparse (STS) attention mechanism that enhances the encoders' ability to extract features from data; 2) a novel Multi-to-Multi Contrastive Predictive Coding (M2M CPC) that learns and fuses features across different modalities. In the final testing, we only use a single modality for emotion recognition, reducing the dependence on multimodal data. Extensive experiments on five public multimodal emotion datasets demonstrate that our model achieves the state-of-the-art performance in the cross-modal tasks and maintains multimodal performance using only a single modality.Downloads
Published
2025-04-11
How to Cite
Liu, Y.-K., Cai, J., Lu, B.-L., & Zheng, W.-L. (2025). Multi-to-Single: Reducing Multimodal Dependency in Emotion Recognition Through Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1438–1446. https://doi.org/10.1609/aaai.v39i2.32134
Issue
Section
AAAI Technical Track on Cognitive Modeling & Cognitive Systems