SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM

Xinyi Wu; Zhenyao Wu; Jinglin Zhang; Lili Ju; Song Wang

doi:10.1609/aaai.v34i07.6927

Authors

Xinyi Wu University of South Carolina
Zhenyao Wu University of South Carolina
Jinglin Zhang Nanjing University of Information Science and Technology
Lili Ju University of South Carolina
Song Wang University of South Carolina

DOI:

https://doi.org/10.1609/aaai.v34i07.6927

Abstract

The performance of predicting human fixations in videos has been much enhanced with the help of development of the convolutional neural networks (CNN). In this paper, we propose a novel end-to-end neural network “SalSAC” for video saliency prediction, which uses the CNN-LSTM-Attention as the basic architecture and utilizes the information from both static and dynamic aspects. To better represent the static information of each frame, we first extract multi-level features of same size from different layers of the encoder CNN and calculate the corresponding multi-level attentions, then we randomly shuffle these attention maps among levels and multiply them to the extracted multi-level features respectively. Through this way, we leverage the attention consistency across different layers to improve the robustness of the network. On the dynamic aspect, we propose a correlation-based ConvLSTM to appropriately balance the influence of the current and preceding frames to the prediction. Experimental results on the DHF1K, Hollywood2 and UCF-sports datasets show that SalSAC outperforms many existing state-of-the-art methods.

SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information