Li, Zhangbin, Jinxing Zhou, Jing Zhang, Shengeng Tang, Kun Li, and Dan Guo. “Patch-Level Sounding Object Tracking for Audio-Visual Question Answering”. Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 5 (April 11, 2025): 5075–5083. Accessed May 31, 2026. https://ojs.aaai.org/index.php/AAAI/article/view/32538.