Beyond Counting: Evaluating Abstract and Emotional Reasoning in Vision-Language Models

Authors

  • Yuan Zhou MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, UCAS
  • Yan Zhang Department of Computer Science and Technology, Tsinghua University
  • Jianlong Chang Huawei
  • Xin Gu Research and Development Department of China Academy of Launch Vehicle Technology
  • Ying Wang MAIS, Institute of Automation, Chinese Academy of Sciences
  • Kun Ding MAIS, Institute of Automation, Chinese Academy of Sciences
  • Guangwen Yang Department of Computer Science and Technology, Tsinghua University
  • Shiming Xiang MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, UCAS

DOI:

https://doi.org/10.1609/aaai.v40i16.38389

Abstract

Despite the rapid progress of Vision Language Models (VLMs), existing benchmarks still concentrate on coarse-grained object recognition or simple relational reasoning, leaving the fine-grained and higher-order reasoning abilities of these systems largely unexamined. To bridge this critical evaluation gap, we introduce EmojiGrid, a novel diagnostic benchmark specifically designed to probe these fine-grained and higher-order skills. Leveraging the universal and semantically rich nature of emojis, we synthesize a grid‑based visual dataset paired with 29,000+ QA pairs. Each pair is explicitly anchored in a three-level cognitive taxonomy comprising (i) Perception and Information Extraction, (ii) Relational and Structural Reasoning, and (iii) Abstraction and Advanced Cognition. These dimensions further decompose into nine categories covering a broad range of cognitive skills, including counting, spatial relations, compositional logic, semantic sentiment, and related higher-order reasoning tasks. Our extensive evaluation of 25 state-of-the-art open-source and proprietary VLMs reveals a significant performance gap between foundational perceptual tasks and higher-level cognitive abilities, particularly in abstraction and advanced emotional reasoning. Notably, all models struggle with compositional logic, spatial consistency, and especially emotional and semantic understanding. EmojiGrid provides a quantifiable, fine-grained benchmark to diagnose VLM limitations and guides future progress toward models that can truly perceive, reason about, and interpret complex, symbol-rich visual scenes.

Published

2026-03-14

How to Cite

Zhou, Y., Zhang, Y., Chang, J., Gu, X., Wang, Y., Ding, K., … Xiang, S. (2026). Beyond Counting: Evaluating Abstract and Emotional Reasoning in Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13809–13817. https://doi.org/10.1609/aaai.v40i16.38389

Issue

Section

AAAI Technical Track on Computer Vision XIII