TY - JOUR AU - Zhao, Sicheng AU - Ma, Yunsheng AU - Gu, Yang AU - Yang, Jufeng AU - Xing, Tengfei AU - Xu, Pengfei AU - Hu, Runbo AU - Chai, Hua AU - Keutzer, Kurt PY - 2020/04/03 Y2 - 2024/03/28 TI - An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos JF - Proceedings of the AAAI Conference on Artificial Intelligence JA - AAAI VL - 34 IS - 01 SE - AAAI Technical Track: AI and the Web DO - 10.1609/aaai.v34i01.5364 UR - https://ojs.aaai.org/index.php/AAAI/article/view/5364 SP - 303-311 AB - <p>Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, <em>i.e.</em> extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, <em>i.e.</em> polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.</p> ER -