Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering

Authors

  • Jing Zhang Department of Computer Science and Engineering, East China University of Science and Technology, China
  • Xiaoqiang Liu Department of Computer Science and Engineering, East China University of Science and Technology, China
  • Mingzhe Chen Department of Computer Science and Engineering, East China University of Science and Technology, China
  • Zhe Wang Department of Computer Science and Engineering, East China University of Science and Technology, China

DOI:

https://doi.org/10.1609/aaai.v38i7.28543

Keywords:

CV: Multi-modal Vision, CV: Language and Vision

Abstract

Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.

Published

2024-03-24

How to Cite

Zhang, J., Liu, X., Chen, M., & Wang, Z. (2024). Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7151–7159. https://doi.org/10.1609/aaai.v38i7.28543

Issue

Section

AAAI Technical Track on Computer Vision VI