MASP: Multi-Aspect Guided Emotion Reasoning with Soft Prompt Tuning In Vision-Language Models

Authors

  • SangEun Lee Electronics and Telecommunications Research Institute
  • Yubeen Lee Sungkyunkwan University
  • Eunil Park Sungkyunkwan University
  • Wonseok Chae Electronics and Telecommunications Research Institute

DOI:

https://doi.org/10.1609/aaai.v40i3.37168

Abstract

Understanding human emotions from images is a challenging yet essential task for vision-language models. While recent efforts have fine-tuned vision-language models to enhance emotional awareness, most approaches rely on global visual representations and fail to capture the nuanced and multi-faceted nature of emotional cues. Furthermore, most existing approaches adopt instruction tuning, which requires costly dataset construction and involves training a large number of parameters, thereby limiting their scalability and efficiency. To address these challenges, we propose MASP, a novel framework for Multi-Aspect guided emotion reasoning with Soft Prompt tuning in vision-language models. MASP explicitly separates emotion-relevant visual cues via multi-aspect cross-attention modules and guides the language model using soft prompts, enabling efficient and scalable task adaptation without modifying the base model. Our method achieves state-of-the-art performance on various emotion recognition benchmarks, demonstrating that the explicit modeling of multi-aspect emotional cues with soft prompt tuning leads to more accurate and interpretable emotion reasoning in vision-language models.

Published

2026-03-14

How to Cite

Lee, S., Lee, Y., Park, E., & Chae, W. (2026). MASP: Multi-Aspect Guided Emotion Reasoning with Soft Prompt Tuning In Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 1882-1890. https://doi.org/10.1609/aaai.v40i3.37168

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems