VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Wencheng Zhu; Yuexin Wang; Hongxuan Li; Pengfei Zhu

doi:10.1609/aaai.v40i16.38408

Authors

Wencheng Zhu School of Artificial Intelligence, Tianjin University Haihe Laboratory of Information Technology Application Innovation
Yuexin Wang School of Artificial Intelligence, Tianjin University
Hongxuan Li School of Artificial Intelligence, Tianjin University
Pengfei Zhu School of Artificial Intelligence, Tianjin University Low-Altitude Intelligence Laboratory, Xiong'an National Innovation Center Xiong'an Guochuang Lantian Technology Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v40i16.38408

Abstract

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing methods primarily rely on parameter-efficient fine-tuning of pre-trained image-text models, suffering from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our approach leverages the frozen text encoder to build a visual codebook derived from video class labels, exploiting the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This enables the transformation of temporal visual features into discrete textual tokens via feature lookups, yielding interpretable video representations through explicit video modeling. Then, to improve robustness against noisy or irrelevant frames, we introduce a confidence-aware fusion module that dynamically weights keyframes based on their semantic relevance, as measured by the codebook. Furthermore, we incorporate learnable text prompts to conduct adaptive codebook updates during training. Experiments on four datasets, including HMDB-51, UCF-101, Something-Something-v2, and Kinetics-400, validate the superiority of our approach, achieving competitive improvements over state-of-the-art approaches.

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information