IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding
DOI:
https://doi.org/10.1609/aaai.v40i9.37624Abstract
Video Large Language Models (VideoLLMs), which adopt large language models for video understanding, have been demonstrated for single-shot videos. However, they usually struggle in multi-shot videos with frequent shot changes, varying camera angles, etc., which makes VideoLLMs hardly answer questions about multiple instances or shots over the whole video. We attribute this challenge to two issues: 1) the lack of multi-shot multi-instance annotations of existing datasets, and 2) the negligence of instance-aware modeling of current VideoLLMs. Therefore, we first introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and question-answering pairs tailored for multi-shot and multi-instance scenarios. Moreover, since the existing VideoLLMs neglect the explicit modeling of instance-related features, we propose a novel Instance Prompt-guided Transformer, named IPFormer, to achieve instance-aware videounderstanding. In the IPFormer, we design a simple but effective instance-aware feature injection module, which encodes instance features as instance prompts via an attention-based connector. By this means, IPFormer can aggregate instance-specific information across multiple shots. Extensive experiments not only show that our dataset and model significantly improve multi-shot video understanding. but also show that our MultiClip-Bench can provide valuable training data and benchmarks for various video understanding tasks.Downloads
Published
2026-03-14
How to Cite
Liang, Y., Jiao, J., Feng, X., Liu, X., Liu, K., Wang, Y., … Wang, Z. (2026). IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 6907–6915. https://doi.org/10.1609/aaai.v40i9.37624
Issue
Section
AAAI Technical Track on Computer Vision VI