Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Authors

  • Wenmeng Yu State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Hua Xu State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Ziqi Yuan State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Jiele Wu State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

DOI:

https://doi.org/10.1609/aaai.v35i12.17289

Keywords:

Multimodal Learning, Language Grounding & Multi-modal NLP, Text Classification & Sentiment Analysis

Abstract

Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annota- tion, existing methods are restricted in capturing differenti- ated information. However, additional unimodal annotations are high time- and labor-cost. In this paper, we design a la- bel generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multimodal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, dur- ing the training stage, we design a weight-adjustment strat- egy to balance the learning progress among different sub- tasks. That is to guide the subtasks to focus on samples with the larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the re- liability and stability of auto-generated unimodal supervi- sions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human- annotated unimodal labels. The full codes are available at https://github.com/thuiar/Self-MM.

Downloads

Published

2021-05-18

How to Cite

Yu, W., Xu, H., Yuan, Z., & Wu, J. (2021). Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 10790-10797. https://doi.org/10.1609/aaai.v35i12.17289

Issue

Section

AAAI Technical Track on Machine Learning V