Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Wenmeng Yu; Hua Xu; Ziqi Yuan; Jiele Wu

doi:10.1609/aaai.v35i12.17289

Authors

Wenmeng Yu State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Hua Xu State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Ziqi Yuan State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Jiele Wu State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

DOI:

https://doi.org/10.1609/aaai.v35i12.17289

Keywords:

Multimodal Learning, Language Grounding & Multi-modal NLP, Text Classification & Sentiment Analysis

Abstract

Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annota- tion, existing methods are restricted in capturing differenti- ated information. However, additional unimodal annotations are high time- and labor-cost. In this paper, we design a la- bel generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multimodal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, dur- ing the training stage, we design a weight-adjustment strat- egy to balance the learning progress among different sub- tasks. That is to guide the subtasks to focus on samples with the larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the re- liability and stability of auto-generated unimodal supervi- sions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human- annotated unimodal labels. The full codes are available at https://github.com/thuiar/Self-MM.

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription