[1]

Y. Yu, C. Cao, Y. Zhang, Q. Lv, L. Min, and Y. Zhang, “Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP”, AAAI, vol. 39, no. 9, pp. 9689–9697, Apr. 2025.