SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Authors

  • Yingying Jiao College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
  • Zhigang Wang College of Computer Science and Technology, Zhejiang Gongshang University
  • Sifan Wu College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
  • Shaojing Fan School of Computing, National University of Singapore
  • Zhenguang Liu The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Zhuoyue Xu College of Computer Science and Technology, Zhejiang Gongshang University
  • Zheqi Wu College of Computer Science and Technology, Zhejiang Gongshang University

DOI:

https://doi.org/10.1609/aaai.v39i4.32429

Abstract

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

Downloads

Published

2025-04-11

How to Cite

Jiao, Y., Wang, Z., Wu, S., Fan, S., Liu, Z., Xu, Z., & Wu, Z. (2025). SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 39(4), 4093-4101. https://doi.org/10.1609/aaai.v39i4.32429

Issue

Section

AAAI Technical Track on Computer Vision III