Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition
DOI:
https://doi.org/10.1609/aaai.v40i14.38166Abstract
Vision-Language Models (VLMs) excel at extracting salient visual features from query images, thus exhibiting promising visual recognition performance. However, VLMs would encounter significant degradation in fine-grained scenarios due to their deficiency in distinguishing nuanced differences among candidate categories. As a remedy, we draw inspiration from the ``System 1 & System 2" cognitive theory of humans, paving the way to achieve fine-grained recognition for VLMs. To be specific, we observe that VLMs naturally align with System 1, quickly identifying candidate categories but leaving easily-confused ones unresolved. Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of nuanced differences. In brief, SCAN first specifies and abstracts the discriminative attributes for the confused candidate categories and query images by resorting to off-the-shelf large foundation models, respectively. After that, SCAN adaptively integrates the salient visual features from System 1 with the nuanced differences derived from System 2, resolving confusion in candidates with estimated uncertainty. Extensive experiments on eight widely used fine-grained recognition benchmarks against 10 state-of-the-art baselines verify the effectiveness and superiority of SCAN.Published
2026-03-14
How to Cite
Yang, Y., Huang, L., Lin, Y., Peng, X., & Yang, M. (2026). Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11802-11810. https://doi.org/10.1609/aaai.v40i14.38166
Issue
Section
AAAI Technical Track on Computer Vision XI