Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Authors

  • Yayuan Li National Key Laboratory for Novel Software Technology, Nanjing University, China
  • Jintao Guo National Key Laboratory for Novel Software Technology, Nanjing University, China
  • Lei Qi School of Computer Science and Engineering, Southeast University, China
  • Wenbin Li National Key Laboratory for Novel Software Technology, Nanjing University, China
  • Yinghuan Shi National Key Laboratory for Novel Software Technology, Nanjing University, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32534

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately ×100 less time cost.

Downloads

Published

2025-04-11

How to Cite

Li, Y., Guo, J., Qi, L., Li, W., & Shi, Y. (2025). Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 5039–5047. https://doi.org/10.1609/aaai.v39i5.32534

Issue

Section

AAAI Technical Track on Computer Vision IV