VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Authors

  • Jun Zhou School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2R3, Canada
  • Chi Xu School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
  • Kaifeng Tang School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
  • Yuting Ge School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
  • Tingrui Guo School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
  • Li Cheng Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2R3, Canada

DOI:

https://doi.org/10.1609/aaai.v40i16.38375

Abstract

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

Downloads

Published

2026-03-14

How to Cite

Zhou, J., Xu, C., Tang, K., Ge, Y., Guo, T., & Cheng, L. (2026). VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13683–13691. https://doi.org/10.1609/aaai.v40i16.38375

Issue

Section

AAAI Technical Track on Computer Vision XIII