Causal Inference over Visual-Semantic-Aligned Graph for Image Classification

Lei Meng; Xiangxian Li; Xiaoshuo Yan; Haokai Ma; Zhuang Qi; Wei Wu; Xiangxu Meng

doi:10.1609/aaai.v39i18.34141

Authors

Lei Meng School of Software, Shandong University, Jinan, China Shandong Research Institute of Industrial Technology, Jinan, China
Xiangxian Li School of Software, Shandong University, Jinan, China School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, China
Xiaoshuo Yan School of Software, Shandong University, Jinan, China
Haokai Ma School of Software, Shandong University, Jinan, China
Zhuang Qi School of Software, Shandong University, Jinan, China
Wei Wu School of Software, Shandong University, Jinan, China
Xiangxu Meng School of Software, Shandong University, Jinan, China

DOI:

https://doi.org/10.1609/aaai.v39i18.34141

Abstract

Incorporating tagging information to regularize the representation learning of images usually leads to improved performance in image classification by aligning the visual features with the textual ones of higher discriminative power. Existing methods typically follow the predictive approach, which uses tags as the semantic labels for visual input to make predictions. However, they typically face the problem of handling the heterogeneity between modalities. In order to learn accurate visual-semantic mapping, this paper presents a visual-semantic causal association modeling framework termed VSCNet. It aligns visual regions with tags, uses a pre-learned hierarchy of visual and semantic exemplars to refine tag predictions and constructs an augmented heterogeneous graph to perform causal intervention. Specifically, the fine-grained visual-semantic alignment (FVA) module adaptively locates the semantic-intensive regions corresponding to tags. The heterogeneous association refinement (HAR) module associates the visual regions, semantic elements and pre-learned visual prototypes in a heterogeneous graph to filter the error predictions and enrich the information. The causal inference with graphical masking (CIM) module applies self-learned masks to discover the causal nodes and edges in the heterogeneous graph to address the spurious association, forming robust causal representations. Experimental results from two benchmarking datasets show that VSCNet effectively builds the visual-semantic associations from images and leads to better performance than the state-of-the-art methods with enriched predictive information.

Causal Inference over Visual-Semantic-Aligned Graph for Image Classification

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information