ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Hao Wang; Fang Liu; Licheng   Jiao; Jiahao Wang; Zehua Hao; Shuo Li; Lingling Li; Puhua Chen; Xu Liu

doi:10.1609/aaai.v38i6.28347

Authors

Hao Wang Xidian University
Fang Liu Xidian University
Licheng Jiao Xidian University
Jiahao Wang Xidian University
Zehua Hao Xidian University
Shuo Li Xidian University
Lingling Li Xidian University
Puhua Chen Xidian University
Xu Liu Xidian University

DOI:

https://doi.org/10.1609/aaai.v38i6.28347

Keywords:

CV: Language and Vision, CV: Image and Video Retrieval, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis

Abstract

Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information