Zhong, W., Zheng, M., Tang, D., Luo, X., Gong, H., Feng, X., & Qin, B. (2023). STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 3715–3723. https://doi.org/10.1609/aaai.v37i3.25483