Inverse Optimal Transport for Efficient Adaptation of Vision-Language Models

Authors

  • Shupeng Qiu SUN YAT-SEN UNIVERSITY
  • Chuan-Xian Ren SUN YAT-SEN UNIVERSITY

DOI:

https://doi.org/10.1609/aaai.v40i30.39690

Abstract

Vision–language models (VLMs) such as CLIP have unlocked powerful zero-shot transfer, yet efficient adaptation to downstream tasks remains challenging. Existing methods often depend on graph structures and dataset-specific tuning, making them sensitive to modality gaps and computationally costly at scale. In this paper, we propose IOTA (Inverse Optimal Transport Adaptation), a lightweight algorithm that reformulates VLMs inference from the perspective of inverse optimal transport (IOT), providing a unified view of training and inference. Under the IOT framework, IOTA enhances zero-shot alignment via a theory-guided unbalanced OT strategy and refines textual prototypes using OT-based pseudo-labels with a marginal-aware adaptive threshold, enabling reliable supervision without gradient updates. The framework naturally extends to few-shot scenarios through a label-guided masking mechanism. By decoupling image–text interactions from other inter-modal dependencies, IOTA avoids task-specific tuning and expensive affinity construction. Extensive experiments on standard benchmarks show that IOTA consistently improves zero-shot and few-shot performance while reducing memory and computation overhead, validating both its theoretical insight and plug-and-play practicality.

Published

2026-03-14

How to Cite

Qiu, S., & Ren, C.-X. (2026). Inverse Optimal Transport for Efficient Adaptation of Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30), 25018–25026. https://doi.org/10.1609/aaai.v40i30.39690

Issue

Section

AAAI Technical Track on Machine Learning VII