Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Authors

  • Yifang Xu Nanjing University
  • Yunzhuo Sun Dalian University of Technology
  • Benxiang Zhai Nanjing University
  • Ming Li Nanjing University
  • Wenxin Liang Dalian University of Technology
  • Yang Li Nanjing University
  • Sidan Du Nanjing University

DOI:

https://doi.org/10.1609/aaai.v39i9.32971

Abstract

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply Video-ChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Downloads

Published

2025-04-11

How to Cite

Xu, Y., Sun, Y., Zhai, B., Li, M., Liang, W., Li, Y., & Du, S. (2025). Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 8978–8986. https://doi.org/10.1609/aaai.v39i9.32971

Issue

Section

AAAI Technical Track on Computer Vision VIII