OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation

Renjie Pan; Jiayan Song; Hua Yang

doi:10.1609/aaai.v40i10.37773

Authors

Renjie Pan School of Information Science and Electronical Engineering, School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China
Jiayan Song School of Information Science and Electronical Engineering, School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China
Hua Yang School of Information Science and Electronical Engineering, School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China

DOI:

https://doi.org/10.1609/aaai.v40i10.37773

Abstract

Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive generalization on vision-language tasks by aligning images and short texts. However, its inherent 77-token length limits the capacity of capturing complex semantics in long captions. Existing long-text adaptations for CLIP typically rely on either multi-stage training or truncation-based alignment, both inevitably resulting in semantic degradation and cumbersome tuning. Therefore, we propose OneLIP, a unified framework that extends CLIP to understand long captions within a single training stage, eliminating the need for brittle truncation or multi-stage pipelines. OneLIP addresses semantic degradation by introducing two key innovations: (1) Token Refinement and Importance-guided Modeling (TRIM) module, which selects and refines informative tokens via SVD-based contribution scoring and cross-modal relevance modeling; (2) Per-sample Online Hard Negative Mining (PO-HNM) strategy dynamically maintains sample-specific negatives based on dual-consistency difficulty tracking, which is superior in long-text scenarios where key semantics are distributed in scattered positions. Extensive experiments on long-text image retrieval, short-text image retrieval, zero-shot classification, and text-to-image generation demonstrate OneLIP's robustness and versatility across diverse input lengths, offering a faithful solution for long-text representation learning of CLIP.

OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information