Wang, Q., Jiang, P., Guo, Z., Han, Y., & Zhao, Z. (2020). Multi-Speaker Video Dialog with Frame-Level Temporal Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 12200-12207. https://doi.org/10.1609/aaai.v34i07.6901