Visually Grounded Commonsense Knowledge Acquisition

Authors

  • Yuan Yao Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
  • Tianyu Yu Tsinghua Shenzhen International Graduate School, Tsinghua University
  • Ao Zhang School of Computing, National University of Singapore, Singapore
  • Mengdi Li Department of Informatics, University of Hamburg, Hamburg, Germany
  • Ruobing Xie WeChat AI, Tencent
  • Cornelius Weber Department of Informatics, University of Hamburg, Hamburg, Germany
  • Zhiyuan Liu Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
  • Hai-Tao Zheng Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory
  • Stefan Wermter Department of Informatics, University of Hamburg, Hamburg, Germany
  • Tat-Seng Chua School of Computing, National University of Singapore, Singapore
  • Maosong Sun Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v37i5.25809

Keywords:

KRR: Knowledge Acquisition, KRR: Common-Sense Reasoning

Abstract

Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER.

Downloads

Published

2023-06-26

How to Cite

Yao, Y., Yu, T., Zhang, A., Li, M., Xie, R., Weber, C., Liu, Z., Zheng, H.-T., Wermter, S., Chua, T.-S., & Sun, M. (2023). Visually Grounded Commonsense Knowledge Acquisition. Proceedings of the AAAI Conference on Artificial Intelligence, 37(5), 6583-6592. https://doi.org/10.1609/aaai.v37i5.25809

Issue

Section

AAAI Technical Track on Knowledge Representation and Reasoning