S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection
DOI:
https://doi.org/10.1609/aaai.v40i41.40834Abstract
Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity from diverse modalities (i.e., image–text pairs), a task that has received increasing attention. While significant progress has been made, existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision–language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm that automatically bootstraps answers with explanations, and (2) a self-calibrating mechanism that rectifies flawed explanations. For generalizability, we design (3) a self-focusing module that amplifies visual semantic entities through preference optimization, thereby mitigating textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as verified by both quantitative metrics and human evaluations.Downloads
Published
2026-03-14
How to Cite
Zhu, Z., Zhang, F., Zhang, Y., Sun, J., Hu, G., Wu, H., … Wu, X. (2026). S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 35266–35274. https://doi.org/10.1609/aaai.v40i41.40834
Issue
Section
AAAI Technical Track on Natural Language Processing VI