Wang, H. (2024) “ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization”, Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), pp. 5390–5400. doi: 10.1609/aaai.v38i6.28347.