Li, Y., Liu, H. and Tang, H. (2022) “Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking”, Proceedings of the AAAI Conference on Artificial Intelligence, 36(2), pp. 1456-1463. doi: 10.1609/aaai.v36i2.20035.