[1]

Y. Li, H. Liu, and H. Tang, “Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking”, AAAI, vol. 36, no. 2, pp. 1456-1463, Jun. 2022.