Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
DOI:
https://doi.org/10.1609/aaai.v38i7.28543Keywords:
CV: Multi-modal Vision, CV: Language and VisionAbstract
Few-shot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.Downloads
Published
2024-03-24
How to Cite
Zhang, J., Liu, X., Chen, M., & Wang, Z. (2024). Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7151–7159. https://doi.org/10.1609/aaai.v38i7.28543
Issue
Section
AAAI Technical Track on Computer Vision VI