[1]

Zhong, W. et al. 2023. STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training. Proceedings of the AAAI Conference on Artificial Intelligence. 37, 3 (Jun. 2023), 3715–3723. DOI:https://doi.org/10.1609/aaai.v37i3.25483.