Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment

Linlin Ji; Li Liu

doi:10.1609/aaai.v40i26.39376

Authors

Linlin Ji School of Information Science and Engineering, Shandong Normal University
Li Liu School of Information Science and Engineering, Shandong Normal University Shandong Province Key Laboratory of Independent and Reliable Computing Technology and Equipment

DOI:

https://doi.org/10.1609/aaai.v40i26.39376

Abstract

Cross-modal alignment is a promising yet challenging task in multimodal learning. Existing methods typically assess it by measuring the cross-modal semantic similarity from both global and local perspectives. However, these methods often neglect their potential interdependence. Specifically, global matching methods suffer from the over-compression of local features, while local matching methods rarely consider the inherent spatial topology of image patches. To address these limitations, we propose MG-Net, a unified framework with two collaborative modules: Multi-View Differential Mixer (MDM) and Graph-Guided Structural Region Selector (GSRS). The MDM is designed to capture discriminative global representations. It generates a series of views by decomposing feature vectors through multi-order differential operations, and adaptively fuses them via a lightweight Mixture-of-Experts (MoE) network. Meanwhile, the GSRS organizes image patches as a spatial graph and employs text-guided contextual reasoning to select spatially coherent and semantically complete structural regions. Extensive experiments on the Flickr30K and MS-COCO benchmarks demonstrate that the proposed MG-Net outperforms state-of-the-art methods in most cases.

Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information