Frequency-Aligned Cross-Modal Learning with Top-K Wavelet Fusion and Dynamic Expert Routing for Enhanced Retinal Disease Diagnosis

Authors

  • Yuxin Lin School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Shenzhen Key Laboratory of Visual Object Detection and Recognition
  • Haoran Li School of Information Technology, University of Wollongong
  • Haoyu Cao School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
  • Yongting Hu School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
  • Qihao Xu School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
  • Chengliang Liu Laboratory for Artificial Intelligence in Design, The Hong Kong Polytechnic University
  • Xiaoling Luo College of Computer Science and Software Engineering, Shenzhen University
  • Zhihao Wu School of Artificial Intelligence, Shenzhen University
  • Yong Xu School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Shenzhen Key Laboratory of Visual Object Detection and Recognition
  • Wei Wang School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen

DOI:

https://doi.org/10.1609/aaai.v40i9.37635

Abstract

Multimodal fusion of color fundus photography (CFP) and optical coherence tomography (OCT) B-scan images has demonstrated superior diagnostic potential for retinal diseases compared to single-modality approaches. However, existing fusion paradigms - whether through naive concatenation or attention mechanisms - treat cross-modal interactions indiscriminately, lacking adaptive modulation of modality-specific contributions under varying clinical scenarios. We propose an adaptive fusion framework that dynamically routes and refines multimodal signals for enhancing disease recognition. The framework comprises two key components: 1) Dynamic Cross-Modal Expert Routing (CMER), which selectively activates convolutional neural network (CNN) experts from one modality based on contextual guidance from the other, ensuring only the most relevant feature extractors contribute to fusion; and 2) Top-K Expert-Guided Wavelet Fusion (TEWF), which performs discrete wavelet transform (DWT) to decompose selected features into low- and high-frequency subbands. Cross-modal attention is then applied specifically to high-frequency components, where lesion-specific microstructures reside, enabling frequency-aware fusion. Finally, inverse DWT (IDWT) reconstructs the fused representation, weighted by CMER-derived importance scores to amplify informative modality cues while suppressing redundancy. Experimental validation on two multimodal retinal datasets demonstrates that our method achieves state-of-the-art performance, outperforming existing fusion strategies by significant margins in disease classification accuracy and robustness.

Downloads

Published

2026-03-14

How to Cite

Lin, Y., Li, H., Cao, H., Hu, Y., Xu, Q., Liu, C., … Wang, W. (2026). Frequency-Aligned Cross-Modal Learning with Top-K Wavelet Fusion and Dynamic Expert Routing for Enhanced Retinal Disease Diagnosis. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7006–7014. https://doi.org/10.1609/aaai.v40i9.37635

Issue

Section

AAAI Technical Track on Computer Vision VI