Zhong, W. (2023) “STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training”, Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), pp. 3715–3723. doi: 10.1609/aaai.v37i3.25483.