S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

Zhihong Zhu; Fan Zhang; Yunyan Zhang; Jinghan Sun; Guimin Hu; Hao Wu; Yuyan Chen; Bowen Xing; Xian Wu

doi:10.1609/aaai.v40i41.40834

Authors

Zhihong Zhu Tencent Jarvis Lab
Fan Zhang The Chinese University of Hong Kong
Yunyan Zhang Tencent Jarvis Lab
Jinghan Sun Tencent Jarvis Lab
Guimin Hu University of Copenhagen
Hao Wu Tencent Jarvis Lab
Yuyan Chen Cornell University
Bowen Xing University of Science and Technology Beijing
Xian Wu Tencent Jarvis Lab

DOI:

https://doi.org/10.1609/aaai.v40i41.40834

Abstract

Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity from diverse modalities (i.e., image–text pairs), a task that has received increasing attention. While significant progress has been made, existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision–language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm that automatically bootstraps answers with explanations, and (2) a self-calibrating mechanism that rectifies flawed explanations. For generalizability, we design (3) a self-focusing module that amplifies visual semantic entities through preference optimization, thereby mitigating textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as verified by both quantitative metrics and human evaluations.

S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information