Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Quang-Hung Le; Long Hoang Dang; Ngan Hoang Le; Truyen Tran; Thao Minh Le

doi:10.1609/aaai.v39i4.32471

Authors

Quang-Hung Le Applied Artificial Intelligence Institute, Deakin University
Long Hoang Dang Posts & Telecommunications Institute of Technology
Ngan Hoang Le University of Arkansas, Fayetteville
Truyen Tran Applied Artificial Intelligence Institute, Deakin University
Thao Minh Le Applied Artificial Intelligence Institute, Deakin University

DOI:

https://doi.org/10.1609/aaai.v39i4.32471

Abstract

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information