PLUM-Net: Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network
DOI:
https://doi.org/10.1609/aaai.v40i22.38928Abstract
Existing multimodal representation learning approaches often rely on simple feature concatenation or unified transformations, which fail to effectively disentangle and leverage common and private information across different modalities in a progressive manner. Moreover, they typically lack adaptive modeling tailored to specific task requirements. To address these limitations, we propose a Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network (PLUM-Net). It first employs a multilevel semantic alignment module to synchronize global and local semantics across audio, visual and textual streams. On this aligned foundation, a prototype-based single-modal label generation module derives modality-specific hard and soft-labels that subtly steer the network toward a cleaner split between shared and private cues. Guided by these labels, the task-conditioned feature bifurcator module channels information through the most beneficial common or private pathway for the given task, after which a private refinement module polishes and fuses each modality’s idiosyncratic signals. Extensive experiments show that PLUM-Net delivers strong performance on datasets such as CMU-MOSI, CMU-MOSEI and UR-FUNNY, achieving an ACC-2 of 90.3% on CMU-MOSI, representing a 2%–4% improvement over previous SOTA models.Downloads
Published
2026-03-14
How to Cite
Wang, K., Zhao, H., Wei, Y., Zha, X., Ye, G., Zhu, C., … Zhang, Z. (2026). PLUM-Net: Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18611–18619. https://doi.org/10.1609/aaai.v40i22.38928
Issue
Section
AAAI Technical Track on Intelligent Robotics