CapeNext: Rethinking and Refining Dynamic Support Information for Category-Agnostic Pose Estimation

Yu Zhu; Dan Zeng; Shuiwang Li; Qijun Zhao; Qiaomu Shen; Bo Tang

doi:10.1609/aaai.v40i16.38410

Authors

Yu Zhu Sun Yat-sen University Southern University of Science and Technology
Dan Zeng Sun Yat-sen University
Shuiwang Li Guilin University of Technology
Qijun Zhao Sichuan University
Qiaomu Shen Beijing Institute of Technology
Bo Tang Southern University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i16.38410

Abstract

Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.

CapeNext: Rethinking and Refining Dynamic Support Information for Category-Agnostic Pose Estimation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information