Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition

Authors

  • Yutong Yang Sichuan University
  • Lifu Huang Nanyang Technological University
  • Yijie Lin Sichuan University
  • Xi Peng Sichuan University, National Key Laboratory of Fundamental Algorithms and Models for Engineering Numerical Simulation, Sichuan University
  • Mouxing Yang Sichuan University

DOI:

https://doi.org/10.1609/aaai.v40i14.38166

Abstract

Vision-Language Models (VLMs) excel at extracting salient visual features from query images, thus exhibiting promising visual recognition performance. However, VLMs would encounter significant degradation in fine-grained scenarios due to their deficiency in distinguishing nuanced differences among candidate categories. As a remedy, we draw inspiration from the ``System 1 & System 2" cognitive theory of humans, paving the way to achieve fine-grained recognition for VLMs. To be specific, we observe that VLMs naturally align with System 1, quickly identifying candidate categories but leaving easily-confused ones unresolved. Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of nuanced differences. In brief, SCAN first specifies and abstracts the discriminative attributes for the confused candidate categories and query images by resorting to off-the-shelf large foundation models, respectively. After that, SCAN adaptively integrates the salient visual features from System 1 with the nuanced differences derived from System 2, resolving confusion in candidates with estimated uncertainty. Extensive experiments on eight widely used fine-grained recognition benchmarks against 10 state-of-the-art baselines verify the effectiveness and superiority of SCAN.

Downloads

Published

2026-03-14

How to Cite

Yang, Y., Huang, L., Lin, Y., Peng, X., & Yang, M. (2026). Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11802-11810. https://doi.org/10.1609/aaai.v40i14.38166

Issue

Section

AAAI Technical Track on Computer Vision XI