DiffusionPose: Markov-Optimized Diffusion Model for Human Pose Estimation

Authors

  • Zhigang Wang The State Key Laboratory of Blockchain and Data Security, Zhejiang University
  • Zhenguang Liu The State Key Laboratory of Blockchain and Data Security, Zhejiang University Shandong Rendui Network Co., Ltd. Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Shaojing Fan Department of Electrical and Computer Engineering, National University of Singapore
  • Sifan Wu College of Computer Science and Technology, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
  • Yingying Jiao College of Computer Science and Technology, Zhejiang University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i12.38012

Abstract

Video-based human pose estimation has long been a nontrivial task due to its dynamic nature and challenging detection scenarios such as occlusion and defocus. Inspired by the success of diffusion models, researchers have applied them to video pose estimation, outperforming traditional joint detection methods. However, existing diffusion model-based methods still face challenges like slow convergence and unstable pose generation. To tackle these issues, we propose DiffusionPose, a novel framework for video pose estimation that integrates diffusion models with optimization strategies: (1) We combine the emerging Mamba with Transformers to balance global and local spatio-temporal modeling. (2) We integrate Markov Random Fields into the reverse diffusion process to enhance the denoising of pose heatmaps, particularly addressing the issue of confused generation of occluded joints. (3) We mathematically formulate a Markov objective to supervise the heatmap denoising process, enabling the model to generate anatomically plausible skeletons. Our method achieves state-of-the-art performance on three large-scale benchmark datasets. Interestingly, it shows surprising robustness in challenging video scenarios, improving the accuracy of the most difficult ankle joint by 16.9% compared to the previous best diffusion model-based method on the Challenging-PoseTrack dataset.

Published

2026-03-14

How to Cite

Wang, Z., Liu, Z., Fan, S., Wu, S., & Jiao, Y. (2026). DiffusionPose: Markov-Optimized Diffusion Model for Human Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 10412–10420. https://doi.org/10.1609/aaai.v40i12.38012

Issue

Section

AAAI Technical Track on Computer Vision IX