Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization

Yishu Liu; Jiawei Zhu; Congcong Wen; Guangming Lu; Hui Lin; Bingzhi Chen

doi:10.1609/aaai.v39i6.32610

Authors

Yishu Liu Harbin Institute of Technology Shenzhen, Shenzhen, China
Jiawei Zhu Beijing Institute of Technology, Zhuhai, China
Congcong Wen China Academic of Electronics and Information Technology, Beijing, China
Guangming Lu Harbin Institute of Technology Shenzhen, Shenzhen, China
Hui Lin China Academic of Electronics and Information Technology, Beijing, China
Bingzhi Chen Beijing Institute of Technology, Zhuhai, China

DOI:

https://doi.org/10.1609/aaai.v39i6.32610

Abstract

Visual Question Answering (VQA) has garnered significant attention as a crucial link between vision and language, aimed at generating accurate responses to visual queries. However, current VQA models still struggle with the challenges of minority class collapse and spurious semantic correlations posed by language bias and imbalanced distributions. To address these challenges, this paper proposes a novel Prompt-Driven Geometric Harmonization (PDGH) paradigm, which integrates both geometric structure and information entropy principles to enhance the ability of VQA models to generalize effectively across diverse scenarios. Specifically, our PDGH approach is meticulously designed to generate image-generated prompts that are guided by specific question cues, facilitating a more accurate and context-aware understanding of the visual content. Moreover, we project the prompt-visual-question and visual-question joint representations into a unified hypersphere space, applying feature weight self-orthogonality and prompt-information entropy correction constraints to optimize the margin, further alleviating minority class collapse and correcting language bias. To maintain the geometric integrity of the representation space, we introduce multi-space geometric contrast constraints to minimize the impact of spurious priors introduced during training. Finally, a semantic matrix is constructed for the coordinated joint representation to ensure that the learned instances are semantically consistent and improve reasoning ability. Extensive experiments on various general and medical VQA datasets demonstrate the consistent superiority of our PDGH approach over existing state-of-the-art baselines.

Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information