Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis
DOI:
https://doi.org/10.1609/aaai.v35i12.17289Keywords:
Multimodal Learning, Language Grounding & Multi-modal NLP, Text Classification & Sentiment AnalysisAbstract
Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annota- tion, existing methods are restricted in capturing differenti- ated information. However, additional unimodal annotations are high time- and labor-cost. In this paper, we design a la- bel generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multimodal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, dur- ing the training stage, we design a weight-adjustment strat- egy to balance the learning progress among different sub- tasks. That is to guide the subtasks to focus on samples with the larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the re- liability and stability of auto-generated unimodal supervi- sions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human- annotated unimodal labels. The full codes are available at https://github.com/thuiar/Self-MM.Downloads
Published
2021-05-18
How to Cite
Yu, W., Xu, H., Yuan, Z., & Wu, J. (2021). Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 10790-10797. https://doi.org/10.1609/aaai.v35i12.17289
Issue
Section
AAAI Technical Track on Machine Learning V