Video-Text Pre-training with Learned Regions for Retrieval

Authors

  • Rui Yan Nanjing University of Science and Technology
  • Mike Zheng Shou National University of Singapore
  • Yixiao Ge Tencent PCG
  • Jinpeng Wang National University of Singapore
  • Xudong Lin Columbia University
  • Guanyu Cai Tongji University
  • Jinhui Tang Nanjing University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v37i3.25414

Keywords:

CV: Image and Video Retrieval, CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis

Abstract

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for video-text representation learning, namely RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs. Given a video, our module (1) first quantizes continuous visual features via clustering patch-features into the same cluster according to content similarity, then (2) generates learnable masks to aggregate fragmentary features into regions with complete semantics, and finally (3) models the spatio-temporal dependencies between different semantic regions. In contrast to using off-the-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner.

Downloads

Published

2023-06-26

How to Cite

Yan, R., Shou, M. Z., Ge, Y., Wang, J., Lin, X., Cai, G., & Tang, J. (2023). Video-Text Pre-training with Learned Regions for Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 3100-3108. https://doi.org/10.1609/aaai.v37i3.25414

Issue

Section

AAAI Technical Track on Computer Vision III