[1]

M. F. Ilaslan, A. Köksal, K. Q. Lin, B. Satar, M. Z. Shou, and Q. Xu, “VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting”, AAAI, vol. 39, no. 4, pp. 3886–3894, Apr. 2025.