SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Yingying Jiao; Zhigang Wang; Sifan Wu; Shaojing Fan; Zhenguang Liu; Zhuoyue Xu; Zheqi Wu

doi:10.1609/aaai.v39i4.32429

Authors

Yingying Jiao College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Zhigang Wang College of Computer Science and Technology, Zhejiang Gongshang University
Sifan Wu College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Shaojing Fan School of Computing, National University of Singapore
Zhenguang Liu The State Key Laboratory of Blockchain and Data Security, Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Zhuoyue Xu College of Computer Science and Technology, Zhejiang Gongshang University
Zheqi Wu College of Computer Science and Technology, Zhejiang Gongshang University

DOI:

https://doi.org/10.1609/aaai.v39i4.32429

Abstract

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information