Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering

Authors

  • Mingxiao Li KULeuven
  • Marie-Francine Moens KU Leuven

DOI:

https://doi.org/10.1609/aaai.v36i10.21346

Keywords:

Speech & Natural Language Processing (SNLP), Computer Vision (CV), Knowledge Representation And Reasoning (KRR), Machine Learning (ML)

Abstract

Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image. It is not only a more challenging task than regular VQA but also a vital step towards building a general VQA system. Most existing knowledge-based VQA systems process knowledge and image information similarly and ignore the fact that the knowledge base (KB) contains complete information about a triplet, while the extracted image information might be incomplete as the relations between two objects are missing or wrongly detected. In this paper, we propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively. Specifically, the memory module learns a dynamic knowledge representation and generates a knowledge-aware question representation at each reasoning step. Then, this representation is used to guide a graph attention operator over the spatial-aware image graph. Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the effectiveness of each component of the proposed model.

Downloads

Published

2022-06-28

How to Cite

Li, M., & Moens, M.-F. (2022). Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10983-10992. https://doi.org/10.1609/aaai.v36i10.21346

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing