To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

Authors

  • Wanlong Fang Artificial Intelligence-X (AI-X) @ NTU, Interdisciplinary Graduate Programme, Nanyang Technological University College of Computing and Data Science, Nanyang Technological University
  • Tianle Zhang College of Computing and Data Science, Nanyang Technological University
  • Alvin Chan College of Computing and Data Science, Nanyang Technological University Lee Kong Chian School of Medicine, Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v40i25.39248

Abstract

Multimodal learning often relies on aligning representations across modalities to enable effective information integration—an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We can find an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work can help practitioners on when and how to enforce alignment for optimal unimodal encoder performance.

Downloads

Published

2026-03-14

How to Cite

Fang, W., Zhang, T., & Chan, A. (2026). To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25), 21056–21064. https://doi.org/10.1609/aaai.v40i25.39248

Issue

Section

AAAI Technical Track on Machine Learning II