Causal Inference over Visual-Semantic-Aligned Graph for Image Classification

Authors

  • Lei Meng School of Software, Shandong University, Jinan, China Shandong Research Institute of Industrial Technology, Jinan, China
  • Xiangxian Li School of Software, Shandong University, Jinan, China School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, China
  • Xiaoshuo Yan School of Software, Shandong University, Jinan, China
  • Haokai Ma School of Software, Shandong University, Jinan, China
  • Zhuang Qi School of Software, Shandong University, Jinan, China
  • Wei Wu School of Software, Shandong University, Jinan, China
  • Xiangxu Meng School of Software, Shandong University, Jinan, China

DOI:

https://doi.org/10.1609/aaai.v39i18.34141

Abstract

Incorporating tagging information to regularize the representation learning of images usually leads to improved performance in image classification by aligning the visual features with the textual ones of higher discriminative power. Existing methods typically follow the predictive approach, which uses tags as the semantic labels for visual input to make predictions. However, they typically face the problem of handling the heterogeneity between modalities. In order to learn accurate visual-semantic mapping, this paper presents a visual-semantic causal association modeling framework termed VSCNet. It aligns visual regions with tags, uses a pre-learned hierarchy of visual and semantic exemplars to refine tag predictions and constructs an augmented heterogeneous graph to perform causal intervention. Specifically, the fine-grained visual-semantic alignment (FVA) module adaptively locates the semantic-intensive regions corresponding to tags. The heterogeneous association refinement (HAR) module associates the visual regions, semantic elements and pre-learned visual prototypes in a heterogeneous graph to filter the error predictions and enrich the information. The causal inference with graphical masking (CIM) module applies self-learned masks to discover the causal nodes and edges in the heterogeneous graph to address the spurious association, forming robust causal representations. Experimental results from two benchmarking datasets show that VSCNet effectively builds the visual-semantic associations from images and leads to better performance than the state-of-the-art methods with enriched predictive information.

Downloads

Published

2025-04-11

How to Cite

Meng, L., Li, X., Yan, X., Ma, H., Qi, Z., Wu, W., & Meng, X. (2025). Causal Inference over Visual-Semantic-Aligned Graph for Image Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 39(18), 19449–19457. https://doi.org/10.1609/aaai.v39i18.34141

Issue

Section

AAAI Technical Track on Machine Learning IV