Multi-to-Single: Reducing Multimodal Dependency in Emotion Recognition Through Contrastive Learning

Authors

  • Yan-Kai Liu Shanghai Jiao Tong University
  • Jinyu Cai Shanghai Jiao Tong University
  • Bao-Liang Lu Shanghai Jiao Tong University
  • Wei-Long Zheng Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i2.32134

Abstract

Multimodal emotion recognition is a crucial research area in the field of affective brain-computer interfaces. However, in practical applications, it is often challenging to obtain all modalities simultaneously. To deal with this problem, researchers focus on using cross-modal methods to learn multimodal representations with fewer modalities. However, due to the significant differences in the distribution of different modalities, it is challenging to enable any modality to fully learn multimodal features. To address this limitation, we propose a Multi-to-Single (M2S) emotion recognition model, leveraging contrastive learning and incorporating two innovative modules: 1) a spatial and temporal-sparse (STS) attention mechanism that enhances the encoders' ability to extract features from data; 2) a novel Multi-to-Multi Contrastive Predictive Coding (M2M CPC) that learns and fuses features across different modalities. In the final testing, we only use a single modality for emotion recognition, reducing the dependence on multimodal data. Extensive experiments on five public multimodal emotion datasets demonstrate that our model achieves the state-of-the-art performance in the cross-modal tasks and maintains multimodal performance using only a single modality.

Downloads

Published

2025-04-11

How to Cite

Liu, Y.-K., Cai, J., Lu, B.-L., & Zheng, W.-L. (2025). Multi-to-Single: Reducing Multimodal Dependency in Emotion Recognition Through Contrastive Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1438–1446. https://doi.org/10.1609/aaai.v39i2.32134

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems