Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Authors

  • Yongxin Zhu University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Zhen Liu University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Yukang Liang University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
  • Xin Li Tencent
  • Hao Liu Tencent
  • Changcun Bao Tencent
  • Linli Xu University of Science and Technology of China State Key Laboratory of Cognitive Intelligence

DOI:

https://doi.org/10.1609/aaai.v37i9.26357

Keywords:

ML: Multimodal Learning, CV: Applications, CV: Language and Vision, CV: Multi-modal Vision, CV: Object Detection & Categorization, CV: Scene Analysis & Understanding, SNLP: Applications, SNLP: Generation, SNLP: Language Grounding, SNLP: Question Answering

Abstract

In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language refinement network, both of which can transform to each other with one-to-one mapping via the scene text bounding box. Next, given the answer words selected by ALM, LTG generates a readable answer sequence with an answer generation module (AGM) based on a pre-trained language model. As a benefit of the explicit alignment of the visual and linguistic semantics, even without any scene text based pre-training tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA dataset and the ST-VQA dataset respectively, compared with a non-pre-training baseline. We further demonstrate that LTG effectively unifies visual and text modalities through the spatial bounding box connection, which is underappreciated in previous methods.

Downloads

Published

2023-06-26

How to Cite

Zhu, Y., Liu, Z., Liang, Y., Li, X., Liu, H., Bao, C., & Xu, L. (2023). Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9), 11479-11487. https://doi.org/10.1609/aaai.v37i9.26357

Issue

Section

AAAI Technical Track on Machine Learning IV