Zhong, Weihong, Mao Zheng, Duyu Tang, Xuan Luo, Heng Gong, Xiaocheng Feng, and Bing Qin. 2023. “STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-Training”. Proceedings of the AAAI Conference on Artificial Intelligence 37 (3):3715-23. https://doi.org/10.1609/aaai.v37i3.25483.