Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment
DOI:
https://doi.org/10.1609/aaai.v40i26.39376Abstract
Cross-modal alignment is a promising yet challenging task in multimodal learning. Existing methods typically assess it by measuring the cross-modal semantic similarity from both global and local perspectives. However, these methods often neglect their potential interdependence. Specifically, global matching methods suffer from the over-compression of local features, while local matching methods rarely consider the inherent spatial topology of image patches. To address these limitations, we propose MG-Net, a unified framework with two collaborative modules: Multi-View Differential Mixer (MDM) and Graph-Guided Structural Region Selector (GSRS). The MDM is designed to capture discriminative global representations. It generates a series of views by decomposing feature vectors through multi-order differential operations, and adaptively fuses them via a lightweight Mixture-of-Experts (MoE) network. Meanwhile, the GSRS organizes image patches as a spatial graph and employs text-guided contextual reasoning to select spatially coherent and semantically complete structural regions. Extensive experiments on the Flickr30K and MS-COCO benchmarks demonstrate that the proposed MG-Net outperforms state-of-the-art methods in most cases.Published
2026-03-14
How to Cite
Ji, L., & Liu, L. (2026). Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 22200–22208. https://doi.org/10.1609/aaai.v40i26.39376
Issue
Section
AAAI Technical Track on Machine Learning III