M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

Authors

  • Trisha Mittal University of Maryland, College Park
  • Uttaran Bhattacharya University of Maryland, College Park
  • Rohan Chandra University of Maryland, College Park
  • Aniket Bera University of Maryland, College Park
  • Dinesh Manocha University of Maryland, College Park

DOI:

https://doi.org/10.1609/aaai.v34i02.5492

Abstract

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

Downloads

Published

2020-04-03

How to Cite

Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020). M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues. Proceedings of the AAAI Conference on Artificial Intelligence, 34(02), 1359-1367. https://doi.org/10.1609/aaai.v34i02.5492

Issue

Section

AAAI Technical Track: Cognitive Systems