Li, Y., Liu, H., & Tang, H. (2022). Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2), 1456-1463. https://doi.org/10.1609/aaai.v36i2.20035