Zhu, Yongxin, et al. “Locate Then Generate: Bridging Vision and Language With Bounding Box for Scene-Text VQA”. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, June 2023, pp. 11479-87, doi:10.1609/aaai.v37i9.26357.