Yu, Y., Cao, C., Zhang, Y., Lv, Q., Min, L., & Zhang, Y. (2025). Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9689–9697. https://doi.org/10.1609/aaai.v39i9.33050