Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation

Authors

  • Lijun Zhang Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
  • Kangkang Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
  • Feng Lu Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory
  • Xiang-Dong Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
  • Yu Shi Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v38i7.28549

Keywords:

CV: Biometrics, Face, Gesture & Pose, CV: Applications

Abstract

Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer.

Published

2024-03-24

How to Cite

Zhang, L., Zhou, K., Lu, F., Zhou, X.-D., & Shi, Y. (2024). Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7205-7214. https://doi.org/10.1609/aaai.v38i7.28549

Issue

Section

AAAI Technical Track on Computer Vision VI