Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization

Authors

  • Yishu Liu Harbin Institute of Technology Shenzhen, Shenzhen, China
  • Jiawei Zhu Beijing Institute of Technology, Zhuhai, China
  • Congcong Wen China Academic of Electronics and Information Technology, Beijing, China
  • Guangming Lu Harbin Institute of Technology Shenzhen, Shenzhen, China
  • Hui Lin China Academic of Electronics and Information Technology, Beijing, China
  • Bingzhi Chen Beijing Institute of Technology, Zhuhai, China

DOI:

https://doi.org/10.1609/aaai.v39i6.32610

Abstract

Visual Question Answering (VQA) has garnered significant attention as a crucial link between vision and language, aimed at generating accurate responses to visual queries. However, current VQA models still struggle with the challenges of minority class collapse and spurious semantic correlations posed by language bias and imbalanced distributions. To address these challenges, this paper proposes a novel Prompt-Driven Geometric Harmonization (PDGH) paradigm, which integrates both geometric structure and information entropy principles to enhance the ability of VQA models to generalize effectively across diverse scenarios. Specifically, our PDGH approach is meticulously designed to generate image-generated prompts that are guided by specific question cues, facilitating a more accurate and context-aware understanding of the visual content. Moreover, we project the prompt-visual-question and visual-question joint representations into a unified hypersphere space, applying feature weight self-orthogonality and prompt-information entropy correction constraints to optimize the margin, further alleviating minority class collapse and correcting language bias. To maintain the geometric integrity of the representation space, we introduce multi-space geometric contrast constraints to minimize the impact of spurious priors introduced during training. Finally, a semantic matrix is constructed for the coordinated joint representation to ensure that the learned instances are semantically consistent and improve reasoning ability. Extensive experiments on various general and medical VQA datasets demonstrate the consistent superiority of our PDGH approach over existing state-of-the-art baselines.

Downloads

Published

2025-04-11

How to Cite

Liu, Y., Zhu, J., Wen, C., Lu, G., Lin, H., & Chen, B. (2025). Towards Robust Visual Question Answering via Prompt-Driven Geometric Harmonization. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 5721–5729. https://doi.org/10.1609/aaai.v39i6.32610

Issue

Section

AAAI Technical Track on Computer Vision V