MASP: Multi-Aspect Guided Emotion Reasoning with Soft Prompt Tuning In Vision-Language Models
DOI:
https://doi.org/10.1609/aaai.v40i3.37168Abstract
Understanding human emotions from images is a challenging yet essential task for vision-language models. While recent efforts have fine-tuned vision-language models to enhance emotional awareness, most approaches rely on global visual representations and fail to capture the nuanced and multi-faceted nature of emotional cues. Furthermore, most existing approaches adopt instruction tuning, which requires costly dataset construction and involves training a large number of parameters, thereby limiting their scalability and efficiency. To address these challenges, we propose MASP, a novel framework for Multi-Aspect guided emotion reasoning with Soft Prompt tuning in vision-language models. MASP explicitly separates emotion-relevant visual cues via multi-aspect cross-attention modules and guides the language model using soft prompts, enabling efficient and scalable task adaptation without modifying the base model. Our method achieves state-of-the-art performance on various emotion recognition benchmarks, demonstrating that the explicit modeling of multi-aspect emotional cues with soft prompt tuning leads to more accurate and interpretable emotion reasoning in vision-language models.Downloads
Published
2026-03-14
How to Cite
Lee, S., Lee, Y., Park, E., & Chae, W. (2026). MASP: Multi-Aspect Guided Emotion Reasoning with Soft Prompt Tuning In Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 1882-1890. https://doi.org/10.1609/aaai.v40i3.37168
Issue
Section
AAAI Technical Track on Cognitive Modeling & Cognitive Systems