Li Y, Liu H, Tang H. Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking. AAAI [Internet]. 2022 Jun. 28 [cited 2026 May 28];36(2):1456-63. Available from: https://ojs.aaai.org/index.php/AAAI/article/view/20035