[1]

Y. Zhu, “Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA”, AAAI, vol. 37, no. 9, pp. 11479–11487, Jun. 2023.