Zero-Shot Vision Language Reasoning via Dual-layer Scene Graph Chain of Thoughts (Student Abstract)

Authors

  • Yash Bansal Indian Institute Of Technology, Roorkee
  • Parshiv Kapoor Indian Institute Of Technology, Roorkee
  • Agam Pandey Indian Institute Of Technology, Roorkee

DOI:

https://doi.org/10.1609/aaai.v40i48.42188

Abstract

Large Multimodal Models (LMMs) often hallucinate objects and struggle with compositional reasoning in complex visual scenes. Structured Scene Graph (SG) representations explicitly encoding objects, attributes, and relations can mitigate these issues, however finetuning risks catastrophic forgetting. Recent zero-shot approaches prompt LMMs with scene graphs, yet typically rely on a single SG generated in one step, limiting capture of holistic context and question-specific details. We introduce a Dual-Layer Scene Graph Chain-of-Thought DLSG-CoT framework that enriches reasoning by combining two structured SGs: a Global Scene Graph (G-SG) that offers comprehensive image context, and a Query-Specific Scene Graph (Q-SG) produced through a two-step process targeting information relevant to the input query. Extensive experiments demonstrate that DLSG-CoT substantially improves LMM performance on compositional and context-sensitive tasks.

Downloads

Published

2026-03-14

How to Cite

Bansal, Y., Kapoor, P., & Pandey, A. (2026). Zero-Shot Vision Language Reasoning via Dual-layer Scene Graph Chain of Thoughts (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41132–41133. https://doi.org/10.1609/aaai.v40i48.42188