Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering

Jing Zhang; Xiaoqiang Liu; Mingzhe Chen; Zhe Wang

doi:10.1609/aaai.v38i7.28543

Authors

Jing Zhang Department of Computer Science and Engineering, East China University of Science and Technology, China
Xiaoqiang Liu Department of Computer Science and Engineering, East China University of Science and Technology, China
Mingzhe Chen Department of Computer Science and Engineering, East China University of Science and Technology, China
Zhe Wang Department of Computer Science and Engineering, East China University of Science and Technology, China

DOI:

https://doi.org/10.1609/aaai.v38i7.28543

Keywords:

CV: Multi-modal Vision, CV: Language and Vision

Abstract

Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.

Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information