CLIP2Pose: Frozen CLIP as Semantic Guide for Domain Adaptive Pose Estimation

Jiawen Li; Fei Jiang; Dandan Zhu; Jinxin Shi; Aimin Zhou

doi:10.1609/aaai.v40i8.37546

Authors

Jiawen Li East China Normal University
Fei Jiang Chongqing academy of Science and Technology
Dandan Zhu East China Normal University
Jinxin Shi East China Normal University
Aimin Zhou East China Normal University

DOI:

https://doi.org/10.1609/aaai.v40i8.37546

Abstract

Unsupervised domain adaptive pose estimation is a fundamental yet challenging task due to the need to transfer from labeled synthetic data to unlabeled real data. Nevertheless, the underlying pose semantics, which are governed by spatial structure, remain largely consistent across domains. This observation motivates the use of vision-language models, which provide domain-invariant representations that align well with high-level semantic concepts. Motivated by this, we propose CLIP2Pose, a novel framework that leverages the semantic robustness of frozen CLIP encoders to facilitate cross-domain generalization. We first introduce a semantic-driven prompt mechanism that encodes structural priors, domain-specific appearance, and instance-level context into the image representation. This guides the model to focus on semantically meaningful and structurally relevant features. Next, we propose a semantic modulation module that adaptively refines visual features by conditioning them on prompt-derived embeddings, enhancing alignment between semantics and visual patterns. To further bridge the modality and domain gaps, we design a directional alignment loss that encourages consistent structural reasoning across both vision and language representations. Extensive experiments on domain adaptive human body and hand pose benchmarks show that CLIP2Pose achieves state-of-the-art performance.

CLIP2Pose: Frozen CLIP as Semantic Guide for Domain Adaptive Pose Estimation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information