Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation

Authors

  • Xingyu Liu State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Pengfei Ren State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Yuanyuan Gao State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Jingyu Wang State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Haifeng Sun State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Qi Qi State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Zirui Zhuang State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
  • Jianxin Liao State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v38i4.28166

Keywords:

CV: Biometrics, Face, Gesture & Pose, CV: Multi-modal Vision

Abstract

Previous 3D hand pose estimation methods primarily rely on a single modality, either RGB or depth, and the comprehensive utilization of the dual modalities has not been extensively explored. RGB and depth data provide complementary information and thus can be fused to enhance the robustness of 3D hand pose estimation. However, there exist two problems for applying existing fusion methods in 3D hand pose estimation: redundancy of dense feature fusion and ambiguity of visual features. First, pixel-wise feature interactions introduce high computational costs and ineffective calculations of invalid pixels. Second, visual features suffer from ambiguity due to color and texture similarities, as well as depth holes and noise caused by frequent hand movements, which interferes with modeling cross-modal correlations. In this paper, we propose Keypoint-Fusion for RGB-D based 3D hand pose estimation, which leverages the unique advantages of dual modalities to mutually eliminate the feature ambiguity, and performs cross-modal feature fusion in a more efficient way. Specifically, we focus cross-modal fusion on sparse yet informative spatial regions (i.e. keypoints). Meanwhile, by explicitly extracting relatively more reliable information as disambiguation evidence, depth modality provides 3D geometric information for RGB feature pixels, and RGB modality complements the precise edge information lost due to the depth noise. Keypoint-Fusion achieves state-of-the-art performance on two challenging hand datasets, significantly decreasing the error compared with previous single-modal methods.

Published

2024-03-24

How to Cite

Liu, X., Ren, P., Gao, Y., Wang, J., Sun, H., Qi, Q., Zhuang, Z., & Liao, J. (2024). Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3756-3764. https://doi.org/10.1609/aaai.v38i4.28166

Issue

Section

AAAI Technical Track on Computer Vision III