Frequency-Aligned Cross-Modal Learning with Top-K Wavelet Fusion and Dynamic Expert Routing for Enhanced Retinal Disease Diagnosis

Yuxin Lin; Haoran Li; Haoyu Cao; Yongting Hu; Qihao Xu; Chengliang Liu; Xiaoling Luo; Zhihao Wu; Yong Xu; Wei Wang

doi:10.1609/aaai.v40i9.37635

Authors

Yuxin Lin School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Shenzhen Key Laboratory of Visual Object Detection and Recognition
Haoran Li School of Information Technology, University of Wollongong
Haoyu Cao School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Yongting Hu School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Qihao Xu School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Chengliang Liu Laboratory for Artificial Intelligence in Design, The Hong Kong Polytechnic University
Xiaoling Luo College of Computer Science and Software Engineering, Shenzhen University
Zhihao Wu School of Artificial Intelligence, Shenzhen University
Yong Xu School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Shenzhen Key Laboratory of Visual Object Detection and Recognition
Wei Wang School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen

DOI:

https://doi.org/10.1609/aaai.v40i9.37635

Abstract

Multimodal fusion of color fundus photography (CFP) and optical coherence tomography (OCT) B-scan images has demonstrated superior diagnostic potential for retinal diseases compared to single-modality approaches. However, existing fusion paradigms - whether through naive concatenation or attention mechanisms - treat cross-modal interactions indiscriminately, lacking adaptive modulation of modality-specific contributions under varying clinical scenarios. We propose an adaptive fusion framework that dynamically routes and refines multimodal signals for enhancing disease recognition. The framework comprises two key components: 1) Dynamic Cross-Modal Expert Routing (CMER), which selectively activates convolutional neural network (CNN) experts from one modality based on contextual guidance from the other, ensuring only the most relevant feature extractors contribute to fusion; and 2) Top-K Expert-Guided Wavelet Fusion (TEWF), which performs discrete wavelet transform (DWT) to decompose selected features into low- and high-frequency subbands. Cross-modal attention is then applied specifically to high-frequency components, where lesion-specific microstructures reside, enabling frequency-aware fusion. Finally, inverse DWT (IDWT) reconstructs the fused representation, weighted by CMER-derived importance scores to amplify informative modality cues while suppressing redundancy. Experimental validation on two multimodal retinal datasets demonstrates that our method achieves state-of-the-art performance, outperforming existing fusion strategies by significant margins in disease classification accuracy and robustness.

Frequency-Aligned Cross-Modal Learning with Top-K Wavelet Fusion and Dynamic Expert Routing for Enhanced Retinal Disease Diagnosis

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information