[1]
Y. Tu, L. Li, L. Su, and Q. Huang, “Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning”, AAAI, vol. 39, no. 7, pp. 7464–7472, Apr. 2025.