[1]

W. Zhong, “STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training”, AAAI, vol. 37, no. 3, pp. 3715–3723, Jun. 2023.