TIM++: Transductive Information Maximization for Few-Shot CLIP

Authors

  • Yingping Li Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
  • Yutong Zou Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
  • Yunshi Huang Shanghai Academy of Artificial Intelligence for Science, Shanghai 200003, China
  • Changzhe Jiao Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
  • Xinlin Wang Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
  • Shen Peng School of Mathematics and Statistics, Xidian University, Xi’an 710071, China
  • Zhang Guo Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China
  • Shuiping Gou Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China

DOI:

https://doi.org/10.1609/aaai.v40i8.37598

Abstract

Transductive Information Maximization (TIM) is a leading transductive few-shot learning method that maximizes the mutual information between query features and their predicted labels, while incorporating supervision from the support set. However, its potential remains underexplored, primarily due to the limited utilization of textual knowledge provided by vision-language models (VLMs) such as CLIP. To address this, we propose TIM++, an enhanced framework that incorporates both visual and textual information for few-shot CLIP adaptation. Specifically, TIM++ introduces a Kullback-Leibler (KL) divergence-based regularization term that encourages the model’s posterior predictions to align with CLIP’s zero-shot output distribution, especially focusing on the most confident predictions. Additionally, we develop an improved prototype initialization strategy that leverages both support and query features enriched with CLIP-guided semantics. Extensive experiments on 11 public datasets demonstrate that TIM++ consistently outperforms the standard TIM, achieving average accuracy gains of 19.25% and 10.88% in 1-shot and 2-shot settings, respectively. TIM++ also surpasses other existing state-of-the-art methods, establishing a new benchmark for few-shot learning with VLMs.

Downloads

Published

2026-03-14

How to Cite

Li, Y., Zou, Y., Huang, Y., Jiao, C., Wang, X., Peng, S., Guo, Z., & Gou, S. (2026). TIM++: Transductive Information Maximization for Few-Shot CLIP. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6671-6680. https://doi.org/10.1609/aaai.v40i8.37598

Issue

Section

AAAI Technical Track on Computer Vision V