Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Authors

  • Xiang Fang Huazhong University of Science and Technology
  • Daizong Liu Peking University
  • Wanlong Fang Henan University Huazhong University of Science and Technology
  • Pan Zhou Huazhong University of Science and Technology
  • Zichuan Xu Dalian University of Technology
  • Wenzheng Xu Sichuan University
  • Junyang Chen Shenzhen Univeristy
  • Renfu Li Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v38i2.27941

Keywords:

CV: Language and Vision, NLP: Language Grounding & Multi-modal NLP

Abstract

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

Published

2024-03-24

How to Cite

Fang, X., Liu, D., Fang, W., Zhou, P., Xu, Z., Xu, W., Chen, J., & Li, R. (2024). Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 1735-1743. https://doi.org/10.1609/aaai.v38i2.27941

Issue

Section

AAAI Technical Track on Computer Vision I