Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

Yuchao Gu; Lijuan Wang; Ziqin Wang; Yun Liu; Ming-Ming Cheng; Shao-Ping Lu

doi:10.1609/aaai.v34i07.6718

Authors

Yuchao Gu Nankai University
Lijuan Wang Nankai University
Ziqin Wang The University of Sydney
Yun Liu Nankai University
Ming-Ming Cheng Nankai University
Shao-Ping Lu Nankai University

DOI:

https://doi.org/10.1609/aaai.v34i07.6718

Abstract

Spatiotemporal information is essential for video salient object detection (VSOD) due to the highly attractive object motion for human's attention. Previous VSOD methods usually use Long Short-Term Memory (LSTM) or 3D ConvNet (C3D), which can only encode motion information through step-by-step propagation in the temporal domain. Recently, the non-local mechanism is proposed to capture long-range dependencies directly. However, it is not straightforward to apply the non-local mechanism into VSOD, because i) it fails to capture motion cues and tends to learn motion-independent global contexts; ii) its computation and memory costs are prohibitive for video dense prediction tasks such as VSOD. To address the above problems, we design a Constrained Self-Attention (CSA) operation to capture motion cues, based on the prior that objects always move in a continuous trajectory. We group a set of CSA operations in Pyramid structures (PCSA) to capture objects at various scales and speeds. Extensive experimental results demonstrate that our method outperforms previous state-of-the-art methods in both accuracy and speed (110 FPS on a single Titan Xp) on five challenge datasets. Our code is available at https://github.com/guyuchao/PyramidCSA.

Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information