Wang, Hao, Fang Liu, Licheng Jiao, Jiahao Wang, Zehua Hao, Shuo Li, Lingling Li, Puhua Chen, and Xu Liu. 2024. “ViLT-CLIP: Video and Language Tuning CLIP With Multimodal Prompt Learning and Scenario-Guided Optimization”. Proceedings of the AAAI Conference on Artificial Intelligence 38 (6):5390-5400. https://doi.org/10.1609/aaai.v38i6.28347.