VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
DOI:
https://doi.org/10.1609/aaai.v40i16.38408Abstract
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing methods primarily rely on parameter-efficient fine-tuning of pre-trained image-text models, suffering from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our approach leverages the frozen text encoder to build a visual codebook derived from video class labels, exploiting the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This enables the transformation of temporal visual features into discrete textual tokens via feature lookups, yielding interpretable video representations through explicit video modeling. Then, to improve robustness against noisy or irrelevant frames, we introduce a confidence-aware fusion module that dynamically weights keyframes based on their semantic relevance, as measured by the codebook. Furthermore, we incorporate learnable text prompts to conduct adaptive codebook updates during training. Experiments on four datasets, including HMDB-51, UCF-101, Something-Something-v2, and Kinetics-400, validate the superiority of our approach, achieving competitive improvements over state-of-the-art approaches.Published
2026-03-14
How to Cite
Zhu, W., Wang, Y., Li, H., & Zhu, P. (2026). VTD-CLIP: Video-to-Text Discretization via Prompting CLIP. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13979–13987. https://doi.org/10.1609/aaai.v40i16.38408
Issue
Section
AAAI Technical Track on Computer Vision XIII