Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval
DOI:
https://doi.org/10.1609/aaai.v37i1.25222Keywords:
CV: Image and Video Retrieval, CV: Multi-modal VisionAbstract
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-training methods suffer from the semantic misalignments. The reason is that these methods ignore sequence alignments but focusing on critical token alignment. To alleviate the problem, we propose a video-language pre-training framework, termed videolanguage pre-training For lEarning sEmantic aLignments (FEEL), to learn semantic alignments at the sequence level. Specifically, the global modality reconstruction and the cross- modal self-contrasting method is utilized to learn the alignments at the sequence level better. Extensive experimental results demonstrate the effectiveness of FEEL on text-based video retrieval and text-based video corpus moment retrieval.Downloads
Published
2023-06-26
How to Cite
Li, M., Shi, X., Leng, H., Zhou, W., Zheng, H.-T., & Zhang, K. (2023). Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 1377-1385. https://doi.org/10.1609/aaai.v37i1.25222
Issue
Section
AAAI Technical Track on Computer Vision I