SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning


  • Zhecan Wang Columbia University
  • Haoxuan You Columbia University
  • Liunian Harold Li University of California, Los Angeles
  • Alireza Zareian Columbia University
  • Suji Park Columbia University
  • Yiqing Liang Columbia University
  • Kai-Wei Chang University of California, Los Angeles
  • Shih-Fu Chang Columbia University



Knowledge Representation And Reasoning (KRR), Computer Vision (CV), Machine Learning (ML), Cognitive Modeling & Cognitive Systems (CMS)


Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.




How to Cite

Wang, Z., You, H., Li, L. H., Zareian, A., Park, S., Liang, Y., Chang, K.-W., & Chang, S.-F. (2022). SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 36(5), 5914-5922.



AAAI Technical Track on Knowledge Representation and Reasoning