[1]

X. Fang, W. Fang, C. Wang, X. Qu, and D. Liu, “Rethinking Video-Language Model from the Language Input Perspective”, AAAI, vol. 40, no. 5, pp. 3885–3893, Mar. 2026.