Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment

Authors

  • Linlin Ji School of Information Science and Engineering, Shandong Normal University
  • Li Liu School of Information Science and Engineering, Shandong Normal University Shandong Province Key Laboratory of Independent and Reliable Computing Technology and Equipment

DOI:

https://doi.org/10.1609/aaai.v40i26.39376

Abstract

Cross-modal alignment is a promising yet challenging task in multimodal learning. Existing methods typically assess it by measuring the cross-modal semantic similarity from both global and local perspectives. However, these methods often neglect their potential interdependence. Specifically, global matching methods suffer from the over-compression of local features, while local matching methods rarely consider the inherent spatial topology of image patches. To address these limitations, we propose MG-Net, a unified framework with two collaborative modules: Multi-View Differential Mixer (MDM) and Graph-Guided Structural Region Selector (GSRS). The MDM is designed to capture discriminative global representations. It generates a series of views by decomposing feature vectors through multi-order differential operations, and adaptively fuses them via a lightweight Mixture-of-Experts (MoE) network. Meanwhile, the GSRS organizes image patches as a spatial graph and employs text-guided contextual reasoning to select spatially coherent and semantically complete structural regions. Extensive experiments on the Flickr30K and MS-COCO benchmarks demonstrate that the proposed MG-Net outperforms state-of-the-art methods in most cases.

Downloads

Published

2026-03-14

How to Cite

Ji, L., & Liu, L. (2026). Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 22200–22208. https://doi.org/10.1609/aaai.v40i26.39376

Issue

Section

AAAI Technical Track on Machine Learning III