VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Authors

  • Wencheng Zhu School of Artificial Intelligence, Tianjin University Haihe Laboratory of Information Technology Application Innovation
  • Yuexin Wang School of Artificial Intelligence, Tianjin University
  • Hongxuan Li School of Artificial Intelligence, Tianjin University
  • Pengfei Zhu School of Artificial Intelligence, Tianjin University Low-Altitude Intelligence Laboratory, Xiong'an National Innovation Center Xiong'an Guochuang Lantian Technology Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v40i16.38408

Abstract

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing methods primarily rely on parameter-efficient fine-tuning of pre-trained image-text models, suffering from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our approach leverages the frozen text encoder to build a visual codebook derived from video class labels, exploiting the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This enables the transformation of temporal visual features into discrete textual tokens via feature lookups, yielding interpretable video representations through explicit video modeling. Then, to improve robustness against noisy or irrelevant frames, we introduce a confidence-aware fusion module that dynamically weights keyframes based on their semantic relevance, as measured by the codebook. Furthermore, we incorporate learnable text prompts to conduct adaptive codebook updates during training. Experiments on four datasets, including HMDB-51, UCF-101, Something-Something-v2, and Kinetics-400, validate the superiority of our approach, achieving competitive improvements over state-of-the-art approaches.

Downloads

Published

2026-03-14

How to Cite

Zhu, W., Wang, Y., Li, H., & Zhu, P. (2026). VTD-CLIP: Video-to-Text Discretization via Prompting CLIP. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13979–13987. https://doi.org/10.1609/aaai.v40i16.38408

Issue

Section

AAAI Technical Track on Computer Vision XIII