Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Yufeng Huang; Jiji Tang; Zhuo Chen; Rongsheng Zhang; Xinfeng Zhang; Weijie Chen; Zeng Zhao; Zhou Zhao; Tangjie Lv; Zhipeng Hu; Wen Zhang

doi:10.1609/aaai.v38i3.28017

Authors

Yufeng Huang Zhejiang University
Jiji Tang Fuxi AI Lab, Netease Inc.
Zhuo Chen Zhejiang University
Rongsheng Zhang Fuxi AI Lab, Netease Inc. Zhejiang University
Xinfeng Zhang Fuxi AI Lab, Netease Inc.
Weijie Chen Fuxi AI Lab, Netease Inc.
Zeng Zhao Fuxi AI Lab, Netease Inc.
Zhou Zhao Zhejiang University
Tangjie Lv Fuxi AI Lab, Netease Inc.
Zhipeng Hu Fuxi AI Lab, Netease Inc.
Wen Zhang Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v38i3.28017

Keywords:

CV: Multi-modal Vision, CV: Language and Vision, DMKM: Mining of Visual, Multimedia & Multimodal Data

Abstract

Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information