Zhu, Y. (2023) “Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA”, Proceedings of the AAAI Conference on Artificial Intelligence, 37(9), pp. 11479–11487. doi: 10.1609/aaai.v37i9.26357.