TIM++: Transductive Information Maximization for Few-Shot CLIP

Yingping Li; Yutong Zou; Yunshi Huang; Changzhe Jiao; Xinlin Wang; Shen Peng; Zhang Guo; Shuiping Gou

doi:10.1609/aaai.v40i8.37598

Authors

Yingping Li Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
Yutong Zou Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
Yunshi Huang Shanghai Academy of Artificial Intelligence for Science, Shanghai 200003, China
Changzhe Jiao Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
Xinlin Wang Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
Shen Peng School of Mathematics and Statistics, Xidian University, Xi’an 710071, China
Zhang Guo Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
Shuiping Gou Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China

DOI:

https://doi.org/10.1609/aaai.v40i8.37598

Abstract

Transductive Information Maximization (TIM) is a leading transductive few-shot learning method that maximizes the mutual information between query features and their predicted labels, while incorporating supervision from the support set. However, its potential remains underexplored, primarily due to the limited utilization of textual knowledge provided by vision-language models (VLMs) such as CLIP. To address this, we propose TIM++, an enhanced framework that incorporates both visual and textual information for few-shot CLIP adaptation. Specifically, TIM++ introduces a Kullback-Leibler (KL) divergence-based regularization term that encourages the model’s posterior predictions to align with CLIP’s zero-shot output distribution, especially focusing on the most confident predictions. Additionally, we develop an improved prototype initialization strategy that leverages both support and query features enriched with CLIP-guided semantics. Extensive experiments on 11 public datasets demonstrate that TIM++ consistently outperforms the standard TIM, achieving average accuracy gains of 19.25% and 10.88% in 1-shot and 2-shot settings, respectively. TIM++ also surpasses other existing state-of-the-art methods, establishing a new benchmark for few-shot learning with VLMs.

TIM++: Transductive Information Maximization for Few-Shot CLIP

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information