Re-Attention for Visual Question Answering

Wenya Guo; Ying Zhang; Xiaoping Wu; Jufeng Yang; Xiangrui Cai; Xiaojie Yuan

doi:10.1609/aaai.v34i01.5338

Authors

Wenya Guo Nankai University
Ying Zhang Nankai University
Xiaoping Wu Nankai University
Jufeng Yang Nankai University
Xiangrui Cai Nankai University
Xiaojie Yuan Nankai University

DOI:

https://doi.org/10.1609/aaai.v34i01.5338

Abstract

Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Existing methods achieve well performance by focusing on both key objects in images and key words in questions. However, the answer also contains rich information which can help to better describe the image and generate more accurate attention maps. In this paper, to utilize the information in answer, we propose a re-attention framework for the VQA task. We first associate image and question by calculating the similarity of each object-word pairs in the feature space. Then, based on the answer, the learned model re-attends the corresponding visual objects in images and reconstructs the initial attention map to produce consistent results. Benefiting from the re-attention procedure, the question can be better understood, and the satisfactory answer is generated. Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches.

Re-Attention for Visual Question Answering

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information