Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation

Lijun Zhang; Kangkang Zhou; Feng Lu; Xiang-Dong Zhou; Yu Shi

doi:10.1609/aaai.v38i7.28549

Authors

Lijun Zhang Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
Kangkang Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
Feng Lu Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory
Xiang-Dong Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
Yu Shi Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v38i7.28549

Keywords:

CV: Biometrics, Face, Gesture & Pose, CV: Applications

Abstract

Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in single-view 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer.

Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription