Multimodal Gaussian Mixture Variational Autoencoder with Consistency Regularizations

Authors

  • Yarui Chen Tianjin University of Science and Technology
  • Lehan Hong Tianjin University of Science and Technology
  • Jianlin Shao Tianjin University of Science and Technology
  • Jianning Yang XI'AN University of Posts&Telecommunications
  • Tingting Zhao Tianjin University of Science and Technology
  • Yun Liao Tianjin University of Science and Technology
  • Yancui Shi Tianjin University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i4.37302

Abstract

Variational autoencoder (VAE)-based frameworks possess a natural advantage in modeling the shared and private information inherent in multimodal data. However, current models focus on improving the quality of shared representations from the reconstruction perspective, lacking explicit mechanisms to model their underlying semantic structure. In this paper, we propose the multimodal Gaussian mixture variational autoencoder with consistency regularizations, which introduces a Gaussian mixture prior over the shared latent space to enhance its semantic structure and encourage the formation of cluster-aware latent representations. To address the cross-modal inconsistency problem under missing modality conditions, we propose a cluster-guided regularization strategy that enforces the cross-modal consistency using the pseudo-category labels from unsupervised clustering. Additionally, we design a self-supervised contrastive regularization strategy to align semantically similar representations across modalities. Extensive experiments on MNIST-SVHN and MNIST-CDCB datasets demonstrate that our method significantly outperforms prior state-of-the-art models in generation, classification, and retrieval tasks.

Downloads

Published

2026-03-14

How to Cite

Chen, Y., Hong, L., Shao, J., Yang, J., Zhao, T., Liao, Y., & Shi, Y. (2026). Multimodal Gaussian Mixture Variational Autoencoder with Consistency Regularizations. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 3092–3100. https://doi.org/10.1609/aaai.v40i4.37302

Issue

Section

AAAI Technical Track on Computer Vision I