Object Attribute Matters in Visual Question Answering

Authors

  • Peize Li School of Artificial Intelligence, Jilin University, Changchun, China
  • Qingyi Si Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
  • Peng Fu Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
  • Zheng Lin Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
  • Yan Wang School of Artificial Intelligence, Jilin University, Changchun, China Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China

DOI:

https://doi.org/10.1609/aaai.v38i17.29816

Keywords:

NLP: Language Grounding & Multi-modal NLP, CV: Language and Vision, NLP: Applications, NLP: Question Answering

Abstract

Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better object-level visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.

Published

2024-03-24

How to Cite

Li, P., Si, Q., Fu, P., Lin, Z., & Wang, Y. (2024). Object Attribute Matters in Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 18545-18553. https://doi.org/10.1609/aaai.v38i17.29816

Issue

Section

AAAI Technical Track on Natural Language Processing II