Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval

Authors

  • Mingchao Li Tsinghua University Alibaba Group
  • Xiaoming Shi Shanghai Artificial Intelligence Laboratory
  • Haitao Leng Alibaba Group
  • Wei Zhou Alibaba Group
  • Hai-Tao Zheng Tsinghua University Peng Cheng Laboratory
  • Kuncai Zhang Alibaba Group

DOI:

https://doi.org/10.1609/aaai.v37i1.25222

Keywords:

CV: Image and Video Retrieval, CV: Multi-modal Vision

Abstract

Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-training methods suffer from the semantic misalignments. The reason is that these methods ignore sequence alignments but focusing on critical token alignment. To alleviate the problem, we propose a video-language pre-training framework, termed videolanguage pre-training For lEarning sEmantic aLignments (FEEL), to learn semantic alignments at the sequence level. Specifically, the global modality reconstruction and the cross- modal self-contrasting method is utilized to learn the alignments at the sequence level better. Extensive experimental results demonstrate the effectiveness of FEEL on text-based video retrieval and text-based video corpus moment retrieval.

Downloads

Published

2023-06-26

How to Cite

Li, M., Shi, X., Leng, H., Zhou, W., Zheng, H.-T., & Zhang, K. (2023). Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 1377-1385. https://doi.org/10.1609/aaai.v37i1.25222

Issue

Section

AAAI Technical Track on Computer Vision I