Li, Yidi, et al. “Multi-Modal Perception Attention Network With Self-Supervised Learning for Audio-Visual Speaker Tracking”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, June 2022, pp. 1456-63, doi:10.1609/aaai.v36i2.20035.