ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Authors

  • Wangyu Xue Department of Computer Science and Technology, Tsinghua University
  • Chen Qian Department of Computer Science and Technology, Tsinghua University SenseTime Research
  • Jiayi Wu SenseTime Research
  • Yang Zhou SenseTime Research
  • Wentao Liu SenseTime Research
  • Ju Ren Department of Computer Science and Technology, Tsinghua University
  • Siming Fan SenseTime Research
  • Yaoxue Zhang Department of Computer Science and Technology, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i9.32979

Abstract

Existing research on human-centric video understanding typically focuses on analyzing specific moments or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos through language queries. This task requires not only a deep semantic understanding of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. The benchmark is meticulously constructed by combining human-annotated highlight frames, duration labels and detailed textual descriptions. These descriptions cover three critical elements: (1) Visual content; (2) Fine-grained actions; and (3) Human pose descriptions. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, which features large-scale and accurate per-frame pose descriptions using PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing state-of-the-art (SOTA) models. ShotVL demonstrates a significant 64% improvement over InternVL on the BestShot Benchmark and a notable 68% improvement on the THUMOS14 Benchmark, while maintaining SOTA performance in general image classification and retrieval.

Downloads

Published

2025-04-11

How to Cite

Xue, W., Qian, C., Wu, J., Zhou, Y., Liu, W., Ren, J., … Zhang, Y. (2025). ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9050–9058. https://doi.org/10.1609/aaai.v39i9.32979

Issue

Section

AAAI Technical Track on Computer Vision VIII