CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

Authors

  • Pengwei Yin Hikvision Research Institute
  • Guanzhong Zeng Hikvision Research Institute
  • Jingjing Wang Hikvision Research Institute
  • Di Xie Hikvision Research Institute

DOI:

https://doi.org/10.1609/aaai.v38i7.28496

Keywords:

CV: Biometrics, Face, Gesture & Pose, CV: Language and Vision

Abstract

Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.

Published

2024-03-24

How to Cite

Yin, P., Zeng, G., Wang, J., & Xie, D. (2024). CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 6729-6737. https://doi.org/10.1609/aaai.v38i7.28496

Issue

Section

AAAI Technical Track on Computer Vision VI