Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Authors

  • Yufeng Huang Zhejiang University
  • Jiji Tang Fuxi AI Lab, Netease Inc.
  • Zhuo Chen Zhejiang University
  • Rongsheng Zhang Fuxi AI Lab, Netease Inc. Zhejiang University
  • Xinfeng Zhang Fuxi AI Lab, Netease Inc.
  • Weijie Chen Fuxi AI Lab, Netease Inc.
  • Zeng Zhao Fuxi AI Lab, Netease Inc.
  • Zhou Zhao Zhejiang University
  • Tangjie Lv Fuxi AI Lab, Netease Inc.
  • Zhipeng Hu Fuxi AI Lab, Netease Inc.
  • Wen Zhang Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v38i3.28017

Keywords:

CV: Multi-modal Vision, CV: Language and Vision, DMKM: Mining of Visual, Multimedia & Multimodal Data

Abstract

Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.

Published

2024-03-24

How to Cite

Huang, Y., Tang, J., Chen, Z., Zhang, R., Zhang, X., Chen, W., … Zhang, W. (2024). Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2417–2425. https://doi.org/10.1609/aaai.v38i3.28017

Issue

Section

AAAI Technical Track on Computer Vision II