ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Authors

  • Hao Wang Xidian University
  • Fang Liu Xidian University
  • Licheng Jiao Xidian University
  • Jiahao Wang Xidian University
  • Zehua Hao Xidian University
  • Shuo Li Xidian University
  • Lingling Li Xidian University
  • Puhua Chen Xidian University
  • Xu Liu Xidian University

DOI:

https://doi.org/10.1609/aaai.v38i6.28347

Keywords:

CV: Language and Vision, CV: Image and Video Retrieval, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis

Abstract

Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.

Published

2024-03-24

How to Cite

Wang, H., Liu, F., Jiao, L. . ., Wang, J., Hao, Z., Li, S., … Liu, X. (2024). ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5390–5400. https://doi.org/10.1609/aaai.v38i6.28347

Issue

Section

AAAI Technical Track on Computer Vision V