VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou; Chi Xu; Kaifeng Tang; Yuting Ge; Tingrui Guo; Li Cheng

doi:10.1609/aaai.v40i16.38375

Authors

Jun Zhou School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2R3, Canada
Chi Xu School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
Kaifeng Tang School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
Yuting Ge School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
Tingrui Guo School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
Li Cheng Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2R3, Canada

DOI:

https://doi.org/10.1609/aaai.v40i16.38375

Abstract

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information