S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

Authors

  • Zhihong Zhu Tencent Jarvis Lab
  • Fan Zhang The Chinese University of Hong Kong
  • Yunyan Zhang Tencent Jarvis Lab
  • Jinghan Sun Tencent Jarvis Lab
  • Guimin Hu University of Copenhagen
  • Hao Wu Tencent Jarvis Lab
  • Yuyan Chen Cornell University
  • Bowen Xing University of Science and Technology Beijing
  • Xian Wu Tencent Jarvis Lab

DOI:

https://doi.org/10.1609/aaai.v40i41.40834

Abstract

Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity from diverse modalities (i.e., image–text pairs), a task that has received increasing attention. While significant progress has been made, existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision–language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm that automatically bootstraps answers with explanations, and (2) a self-calibrating mechanism that rectifies flawed explanations. For generalizability, we design (3) a self-focusing module that amplifies visual semantic entities through preference optimization, thereby mitigating textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as verified by both quantitative metrics and human evaluations.

Downloads

Published

2026-03-14

How to Cite

Zhu, Z., Zhang, F., Zhang, Y., Sun, J., Hu, G., Wu, H., … Wu, X. (2026). S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35266–35274. https://doi.org/10.1609/aaai.v40i41.40834

Issue

Section

AAAI Technical Track on Natural Language Processing VI