Zero-Shot Vision Language Reasoning via Dual-layer Scene Graph Chain of Thoughts (Student Abstract)
DOI:
https://doi.org/10.1609/aaai.v40i48.42188Abstract
Large Multimodal Models (LMMs) often hallucinate objects and struggle with compositional reasoning in complex visual scenes. Structured Scene Graph (SG) representations explicitly encoding objects, attributes, and relations can mitigate these issues, however finetuning risks catastrophic forgetting. Recent zero-shot approaches prompt LMMs with scene graphs, yet typically rely on a single SG generated in one step, limiting capture of holistic context and question-specific details. We introduce a Dual-Layer Scene Graph Chain-of-Thought DLSG-CoT framework that enriches reasoning by combining two structured SGs: a Global Scene Graph (G-SG) that offers comprehensive image context, and a Query-Specific Scene Graph (Q-SG) produced through a two-step process targeting information relevant to the input query. Extensive experiments demonstrate that DLSG-CoT substantially improves LMM performance on compositional and context-sensitive tasks.Downloads
Published
2026-03-14
How to Cite
Bansal, Y., Kapoor, P., & Pandey, A. (2026). Zero-Shot Vision Language Reasoning via Dual-layer Scene Graph Chain of Thoughts (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41132–41133. https://doi.org/10.1609/aaai.v40i48.42188
Issue
Section
AAAI Student Abstract and Poster Program