Detection-Based Intermediate Supervision for Visual Question Answering

Authors

  • Yuhang Liu CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL) ByteDance Inc.
  • Daowan Peng CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL)
  • Wei Wei CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL)
  • Yuanyuan Fu Ping An Property & Casualty Insurance Company of China, Ltd Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL)
  • Wenfeng Xie Ping An Property & Casualty Insurance Company of China, Ltd Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL)
  • Dangyang Chen Ping An Property & Casualty Insurance Company of China, Ltd Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL)

DOI:

https://doi.org/10.1609/aaai.v38i12.29315

Keywords:

ML: Multimodal Learning, CV: Language and Vision

Abstract

Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.

Published

2024-03-24

How to Cite

Liu, Y., Peng, D., Wei, W., Fu, Y., Xie, W., & Chen, D. (2024). Detection-Based Intermediate Supervision for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12), 14061-14068. https://doi.org/10.1609/aaai.v38i12.29315

Issue

Section

AAAI Technical Track on Machine Learning III