An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Authors

  • Sicheng Zhao University of California, Berkeley
  • Yunsheng Ma Didi Chuxing & Harbin Institute of Technology, Weihai
  • Yang Gu Didi Chuxing
  • Jufeng Yang Nankai University
  • Tengfei Xing Didi Chuxing
  • Pengfei Xu Didi Chuxing
  • Runbo Hu Didi Chuxing
  • Hua Chai Didi Chuxing
  • Kurt Keutzer University of California, Berkeley

DOI:

https://doi.org/10.1609/aaai.v34i01.5364

Abstract

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

Downloads

Published

2020-04-03

How to Cite

Zhao, S., Ma, Y., Gu, Y., Yang, J., Xing, T., Xu, P., Hu, R., Chai, H., & Keutzer, K. (2020). An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01), 303-311. https://doi.org/10.1609/aaai.v34i01.5364

Issue

Section

AAAI Technical Track: AI and the Web