LI, Yidi; LIU, Hong; TANG, Hao. Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, [S. l.], v. 36, n. 2, p. 1456–1463, 2022. DOI: 10.1609/aaai.v36i2.20035. Disponível em: https://ojs.aaai.org/index.php/AAAI/article/view/20035. Acesso em: 28 may. 2026.