OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation
DOI:
https://doi.org/10.1609/aaai.v40i10.37773Abstract
Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive generalization on vision-language tasks by aligning images and short texts. However, its inherent 77-token length limits the capacity of capturing complex semantics in long captions. Existing long-text adaptations for CLIP typically rely on either multi-stage training or truncation-based alignment, both inevitably resulting in semantic degradation and cumbersome tuning. Therefore, we propose OneLIP, a unified framework that extends CLIP to understand long captions within a single training stage, eliminating the need for brittle truncation or multi-stage pipelines. OneLIP addresses semantic degradation by introducing two key innovations: (1) Token Refinement and Importance-guided Modeling (TRIM) module, which selects and refines informative tokens via SVD-based contribution scoring and cross-modal relevance modeling; (2) Per-sample Online Hard Negative Mining (PO-HNM) strategy dynamically maintains sample-specific negatives based on dual-consistency difficulty tracking, which is superior in long-text scenarios where key semantics are distributed in scattered positions. Extensive experiments on long-text image retrieval, short-text image retrieval, zero-shot classification, and text-to-image generation demonstrate OneLIP's robustness and versatility across diverse input lengths, offering a faithful solution for long-text representation learning of CLIP.Downloads
Published
2026-03-14
How to Cite
Pan, R., Song, J., & Yang, H. (2026). OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8251–8259. https://doi.org/10.1609/aaai.v40i10.37773
Issue
Section
AAAI Technical Track on Computer Vision VII