SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM

Authors

  • Xinyi Wu University of South Carolina
  • Zhenyao Wu University of South Carolina
  • Jinglin Zhang Nanjing University of Information Science and Technology
  • Lili Ju University of South Carolina
  • Song Wang University of South Carolina

DOI:

https://doi.org/10.1609/aaai.v34i07.6927

Abstract

The performance of predicting human fixations in videos has been much enhanced with the help of development of the convolutional neural networks (CNN). In this paper, we propose a novel end-to-end neural network “SalSAC” for video saliency prediction, which uses the CNN-LSTM-Attention as the basic architecture and utilizes the information from both static and dynamic aspects. To better represent the static information of each frame, we first extract multi-level features of same size from different layers of the encoder CNN and calculate the corresponding multi-level attentions, then we randomly shuffle these attention maps among levels and multiply them to the extracted multi-level features respectively. Through this way, we leverage the attention consistency across different layers to improve the robustness of the network. On the dynamic aspect, we propose a correlation-based ConvLSTM to appropriately balance the influence of the current and preceding frames to the prediction. Experimental results on the DHF1K, Hollywood2 and UCF-sports datasets show that SalSAC outperforms many existing state-of-the-art methods.

Downloads

Published

2020-04-03

How to Cite

Wu, X., Wu, Z., Zhang, J., Ju, L., & Wang, S. (2020). SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 12410-12417. https://doi.org/10.1609/aaai.v34i07.6927

Issue

Section

AAAI Technical Track: Vision