(1)

Zhong, W.; Zheng, M.; Tang, D.; Luo, X.; Gong, H.; Feng, X.; Qin, B. STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-Training. AAAI 2023, 37, 3715-3723.