Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

Authors

  • Yuchao Gu Nankai University
  • Lijuan Wang Nankai University
  • Ziqin Wang The University of Sydney
  • Yun Liu Nankai University
  • Ming-Ming Cheng Nankai University
  • Shao-Ping Lu Nankai University

DOI:

https://doi.org/10.1609/aaai.v34i07.6718

Abstract

Spatiotemporal information is essential for video salient object detection (VSOD) due to the highly attractive object motion for human's attention. Previous VSOD methods usually use Long Short-Term Memory (LSTM) or 3D ConvNet (C3D), which can only encode motion information through step-by-step propagation in the temporal domain. Recently, the non-local mechanism is proposed to capture long-range dependencies directly. However, it is not straightforward to apply the non-local mechanism into VSOD, because i) it fails to capture motion cues and tends to learn motion-independent global contexts; ii) its computation and memory costs are prohibitive for video dense prediction tasks such as VSOD. To address the above problems, we design a Constrained Self-Attention (CSA) operation to capture motion cues, based on the prior that objects always move in a continuous trajectory. We group a set of CSA operations in Pyramid structures (PCSA) to capture objects at various scales and speeds. Extensive experimental results demonstrate that our method outperforms previous state-of-the-art methods in both accuracy and speed (110 FPS on a single Titan Xp) on five challenge datasets. Our code is available at https://github.com/guyuchao/PyramidCSA.

Downloads

Published

2020-04-03

How to Cite

Gu, Y., Wang, L., Wang, Z., Liu, Y., Cheng, M.-M., & Lu, S.-P. (2020). Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 10869-10876. https://doi.org/10.1609/aaai.v34i07.6718

Issue

Section

AAAI Technical Track: Vision