CLIP2Pose: Frozen CLIP as Semantic Guide for Domain Adaptive Pose Estimation
DOI:
https://doi.org/10.1609/aaai.v40i8.37546Abstract
Unsupervised domain adaptive pose estimation is a fundamental yet challenging task due to the need to transfer from labeled synthetic data to unlabeled real data. Nevertheless, the underlying pose semantics, which are governed by spatial structure, remain largely consistent across domains. This observation motivates the use of vision-language models, which provide domain-invariant representations that align well with high-level semantic concepts. Motivated by this, we propose CLIP2Pose, a novel framework that leverages the semantic robustness of frozen CLIP encoders to facilitate cross-domain generalization. We first introduce a semantic-driven prompt mechanism that encodes structural priors, domain-specific appearance, and instance-level context into the image representation. This guides the model to focus on semantically meaningful and structurally relevant features. Next, we propose a semantic modulation module that adaptively refines visual features by conditioning them on prompt-derived embeddings, enhancing alignment between semantics and visual patterns. To further bridge the modality and domain gaps, we design a directional alignment loss that encourages consistent structural reasoning across both vision and language representations. Extensive experiments on domain adaptive human body and hand pose benchmarks show that CLIP2Pose achieves state-of-the-art performance.Downloads
Published
2026-03-14
How to Cite
Li, J., Jiang, F., Zhu, D., Shi, J., & Zhou, A. (2026). CLIP2Pose: Frozen CLIP as Semantic Guide for Domain Adaptive Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6208–6216. https://doi.org/10.1609/aaai.v40i8.37546
Issue
Section
AAAI Technical Track on Computer Vision V