SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition

Authors

  • Cong Wu Jiangnan University, University of Surrey
  • Xiao-Jun Wu Jiangnan University
  • Josef Kittler University of Surrey
  • Tianyang Xu Jiangnan University
  • Sara Ahmed University of Surrey
  • Muhammad Awais University of Surrey
  • Zhenhua Feng University of Surrey

DOI:

https://doi.org/10.1609/aaai.v38i6.28409

Keywords:

CV: Video Understanding & Activity Analysis, ML: Unsupervised & Self-Supervised Learning

Abstract

Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly. Our code and supplementary material can be found at https://github.com/cong-wu/SCD-Net.

Published

2024-03-24

How to Cite

Wu, C., Wu, X.-J., Kittler, J., Xu, T., Ahmed, S., Awais, M., & Feng, Z. (2024). SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5949–5957. https://doi.org/10.1609/aaai.v38i6.28409

Issue

Section

AAAI Technical Track on Computer Vision V