PLUM-Net: Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network

Kehan Wang; Huan Zhao; Yong Wei; Xupeng Zha; Guanghui Ye; Cheng Zhu; Yiming Liu; Zixing Zhang

doi:10.1609/aaai.v40i22.38928

Authors

Kehan Wang Hunan University
Huan Zhao Hunan University
Yong Wei Hunan University
Xupeng Zha Hunan University
Guanghui Ye Hunan University
Cheng Zhu Hunan University
Yiming Liu Hunan University
Zixing Zhang Hunan University

DOI:

https://doi.org/10.1609/aaai.v40i22.38928

Abstract

Existing multimodal representation learning approaches often rely on simple feature concatenation or unified transformations, which fail to effectively disentangle and leverage common and private information across different modalities in a progressive manner. Moreover, they typically lack adaptive modeling tailored to specific task requirements. To address these limitations, we propose a Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network (PLUM-Net). It first employs a multilevel semantic alignment module to synchronize global and local semantics across audio, visual and textual streams. On this aligned foundation, a prototype-based single-modal label generation module derives modality-specific hard and soft-labels that subtly steer the network toward a cleaner split between shared and private cues. Guided by these labels, the task-conditioned feature bifurcator module channels information through the most beneficial common or private pathway for the given task, after which a private refinement module polishes and fuses each modality’s idiosyncratic signals. Extensive experiments show that PLUM-Net delivers strong performance on datasets such as CMU-MOSI, CMU-MOSEI and UR-FUNNY, achieving an ACC-2 of 90.3% on CMU-MOSI, representing a 2%–4% improvement over previous SOTA models.

PLUM-Net: Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information