Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition

Yutong Yang; Lifu Huang; Yijie Lin; Xi Peng; Mouxing Yang

doi:10.1609/aaai.v40i14.38166

Authors

Yutong Yang Sichuan University
Lifu Huang Nanyang Technological University
Yijie Lin Sichuan University
Xi Peng Sichuan University, National Key Laboratory of Fundamental Algorithms and Models for Engineering Numerical Simulation, Sichuan University
Mouxing Yang Sichuan University

DOI:

https://doi.org/10.1609/aaai.v40i14.38166

Abstract

Vision-Language Models (VLMs) excel at extracting salient visual features from query images, thus exhibiting promising visual recognition performance. However, VLMs would encounter significant degradation in fine-grained scenarios due to their deficiency in distinguishing nuanced differences among candidate categories. As a remedy, we draw inspiration from the ``System 1 & System 2" cognitive theory of humans, paving the way to achieve fine-grained recognition for VLMs. To be specific, we observe that VLMs naturally align with System 1, quickly identifying candidate categories but leaving easily-confused ones unresolved. Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of nuanced differences. In brief, SCAN first specifies and abstracts the discriminative attributes for the confused candidate categories and query images by resorting to off-the-shelf large foundation models, respectively. After that, SCAN adaptively integrates the salient visual features from System 1 with the nuanced differences derived from System 2, resolving confusion in candidates with estimated uncertainty. Extensive experiments on eight widely used fine-grained recognition benchmarks against 10 state-of-the-art baselines verify the effectiveness and superiority of SCAN.

Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information