Zhong, Weihong, et al. “STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-Training”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, June 2023, pp. 3715-23, doi:10.1609/aaai.v37i3.25483.