TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Authors

  • Chaoya Jiang Peking University
  • Wei Ye Peking University
  • Haiyang Xu Alibaba Group
  • Qinghao Ye Alibaba Group
  • Ming Yan Alibaba Group
  • Ji Zhang Alibaba Group
  • Shikun Zhang Peking University

DOI:

https://doi.org/10.1609/aaai.v38i3.28025

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, ML: Multimodal Learning, ML: Unsupervised & Self-Supervised Learning

Abstract

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main.

Downloads

Published

2024-03-24

How to Cite

Jiang, C., Ye, W., Xu, H., Ye, Q., Yan, M., Zhang, J., & Zhang, S. (2024). TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2489-2497. https://doi.org/10.1609/aaai.v38i3.28025

Issue

Section

AAAI Technical Track on Computer Vision II