ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Wangyu Xue; Chen Qian; Jiayi Wu; Yang Zhou; Wentao Liu; Ju Ren; Siming Fan; Yaoxue Zhang

doi:10.1609/aaai.v39i9.32979

Authors

Wangyu Xue Department of Computer Science and Technology, Tsinghua University
Chen Qian Department of Computer Science and Technology, Tsinghua University SenseTime Research
Jiayi Wu SenseTime Research
Yang Zhou SenseTime Research
Wentao Liu SenseTime Research
Ju Ren Department of Computer Science and Technology, Tsinghua University
Siming Fan SenseTime Research
Yaoxue Zhang Department of Computer Science and Technology, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i9.32979

Abstract

Existing research on human-centric video understanding typically focuses on analyzing specific moments or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos through language queries. This task requires not only a deep semantic understanding of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. The benchmark is meticulously constructed by combining human-annotated highlight frames, duration labels and detailed textual descriptions. These descriptions cover three critical elements: (1) Visual content; (2) Fine-grained actions; and (3) Human pose descriptions. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, which features large-scale and accurate per-frame pose descriptions using PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing state-of-the-art (SOTA) models. ShotVL demonstrates a significant 64% improvement over InternVL on the BestShot Benchmark and a notable 68% improvement on the THUMOS14 Benchmark, while maintaining SOTA performance in general image classification and retrieval.

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information