Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation

Xingyu Liu; Pengfei Ren; Yuanyuan Gao; Jingyu Wang; Haifeng Sun; Qi Qi; Zirui Zhuang; Jianxin Liao

doi:10.1609/aaai.v38i4.28166

Authors

Xingyu Liu State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Pengfei Ren State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Yuanyuan Gao State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Jingyu Wang State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Haifeng Sun State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Qi Qi State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Zirui Zhuang State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Jianxin Liao State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v38i4.28166

Keywords:

CV: Biometrics, Face, Gesture & Pose, CV: Multi-modal Vision

Abstract

Previous 3D hand pose estimation methods primarily rely on a single modality, either RGB or depth, and the comprehensive utilization of the dual modalities has not been extensively explored. RGB and depth data provide complementary information and thus can be fused to enhance the robustness of 3D hand pose estimation. However, there exist two problems for applying existing fusion methods in 3D hand pose estimation: redundancy of dense feature fusion and ambiguity of visual features. First, pixel-wise feature interactions introduce high computational costs and ineffective calculations of invalid pixels. Second, visual features suffer from ambiguity due to color and texture similarities, as well as depth holes and noise caused by frequent hand movements, which interferes with modeling cross-modal correlations. In this paper, we propose Keypoint-Fusion for RGB-D based 3D hand pose estimation, which leverages the unique advantages of dual modalities to mutually eliminate the feature ambiguity, and performs cross-modal feature fusion in a more efficient way. Specifically, we focus cross-modal fusion on sparse yet informative spatial regions (i.e. keypoints). Meanwhile, by explicitly extracting relatively more reliable information as disambiguation evidence, depth modality provides 3D geometric information for RGB feature pixels, and RGB modality complements the precise edge information lost due to the depth noise. Keypoint-Fusion achieves state-of-the-art performance on two challenging hand datasets, significantly decreasing the error compared with previous single-modal methods.

Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information