CLIM: Contrastive Language-Image Mosaic for Region Representation

Authors

  • Size Wu Nanyang Technological University
  • Wenwei Zhang Nanyang Technological University
  • Lumin Xu The Chinese University of Hong Kong
  • Sheng Jin The University of Hong Kong SenseTime Research and Tetras.AI
  • Wentao Liu SenseTime Research and Tetras.AI Shanghai AI Laboratory
  • Chen Change Loy Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v38i6.28428

Keywords:

CV: Object Detection & Categorization, CV: Language and Vision

Abstract

Detecting objects accurately from a large or open vocabulary necessitates the vision-language alignment on region representations. However, learning such a region-text alignment by obtaining high-quality box annotations with text labels or descriptions is expensive and infeasible. In contrast, collecting image-text pairs is simpler but lacks precise object location information to associate regions with texts. In this paper, we propose a novel approach called Contrastive Language-Image Mosaic (CLIM), which leverages large-scale image-text pairs effectively for aligning region and text representations. CLIM combines multiple images into a mosaicked image and treats each image as a ‘pseudo region’. The feature of each pseudo region is extracted and trained to be similar to the corresponding text embedding while dissimilar from others by a contrastive loss, enabling the model to learn the region-text alignment without costly box annotations. As a generally applicable approach, CLIM consistently improves different open-vocabulary object detection methods that use caption supervision. Furthermore, CLIM can effectively enhance the region representation of vision-language models, thus providing stronger backbones for open-vocabulary object detectors. Our experimental results demonstrate that CLIM improves different baseline open-vocabulary object detectors by a large margin on both OV-COCO and OV-LVIS benchmarks. The code is available at https://github.com/wusize/CLIM.

Published

2024-03-24

How to Cite

Wu, S., Zhang, W., Xu, L., Jin, S., Liu, W., & Loy, C. C. (2024). CLIM: Contrastive Language-Image Mosaic for Region Representation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6117-6125. https://doi.org/10.1609/aaai.v38i6.28428

Issue

Section

AAAI Technical Track on Computer Vision V