Wang, H., Liu, F., Jiao, L. . ., Wang, J., Hao, Z., Li, S., … Liu, X. (2024). ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5390–5400. https://doi.org/10.1609/aaai.v38i6.28347