IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding

Yujia Liang; Jile Jiao; Xuetao Feng; Xinchen Liu; Kun Liu; Yuan Wang; Zixuan Ye; Hao Lu; Zhicheng Wang

doi:10.1609/aaai.v40i9.37624

Authors

Yujia Liang JD Explore Academy
Jile Jiao Deepeleph Intelligent Technology
Xuetao Feng Deepeleph Intelligent Technology
Xinchen Liu JD Explore Academy
Kun Liu JD Explore Academy
Yuan Wang Deepeleph Intelligent Technology
Zixuan Ye School of AIA, Huazhong University of Science and Technology
Hao Lu School of AIA, Huazhong University of Science and Technology
Zhicheng Wang School of AIA, Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i9.37624

Abstract

Video Large Language Models (VideoLLMs), which adopt large language models for video understanding, have been demonstrated for single-shot videos. However, they usually struggle in multi-shot videos with frequent shot changes, varying camera angles, etc., which makes VideoLLMs hardly answer questions about multiple instances or shots over the whole video. We attribute this challenge to two issues: 1) the lack of multi-shot multi-instance annotations of existing datasets, and 2) the negligence of instance-aware modeling of current VideoLLMs. Therefore, we first introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and question-answering pairs tailored for multi-shot and multi-instance scenarios. Moreover, since the existing VideoLLMs neglect the explicit modeling of instance-related features, we propose a novel Instance Prompt-guided Transformer, named IPFormer, to achieve instance-aware videounderstanding. In the IPFormer, we design a simple but effective instance-aware feature injection module, which encodes instance features as instance prompts via an attention-based connector. By this means, IPFormer can aggregate instance-specific information across multiple shots. Extensive experiments not only show that our dataset and model significantly improve multi-shot video understanding. but also show that our MultiClip-Bench can provide valuable training data and benchmarks for various video understanding tasks.

IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information