Zhu, Y., Liu, Z., Liang, Y., Li, X., Liu, H., Bao, C., & Xu, L. (2023). Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9), 11479–11487. https://doi.org/10.1609/aaai.v37i9.26357