Zhong W, Zheng M, Tang D, Luo X, Gong H, Feng X, et al. STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training. AAAI [Internet]. 2023 Jun. 26 [cited 2026 May 13];37(3):3715-23. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/25483