IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding

Authors

  • Yujia Liang JD Explore Academy
  • Jile Jiao Deepeleph Intelligent Technology
  • Xuetao Feng Deepeleph Intelligent Technology
  • Xinchen Liu JD Explore Academy
  • Kun Liu JD Explore Academy
  • Yuan Wang Deepeleph Intelligent Technology
  • Zixuan Ye School of AIA, Huazhong University of Science and Technology
  • Hao Lu School of AIA, Huazhong University of Science and Technology
  • Zhicheng Wang School of AIA, Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i9.37624

Abstract

Video Large Language Models (VideoLLMs), which adopt large language models for video understanding, have been demonstrated for single-shot videos. However, they usually struggle in multi-shot videos with frequent shot changes, varying camera angles, etc., which makes VideoLLMs hardly answer questions about multiple instances or shots over the whole video. We attribute this challenge to two issues: 1) the lack of multi-shot multi-instance annotations of existing datasets, and 2) the negligence of instance-aware modeling of current VideoLLMs. Therefore, we first introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and question-answering pairs tailored for multi-shot and multi-instance scenarios. Moreover, since the existing VideoLLMs neglect the explicit modeling of instance-related features, we propose a novel Instance Prompt-guided Transformer, named IPFormer, to achieve instance-aware videounderstanding. In the IPFormer, we design a simple but effective instance-aware feature injection module, which encodes instance features as instance prompts via an attention-based connector. By this means, IPFormer can aggregate instance-specific information across multiple shots. Extensive experiments not only show that our dataset and model significantly improve multi-shot video understanding. but also show that our MultiClip-Bench can provide valuable training data and benchmarks for various video understanding tasks.

Downloads

Published

2026-03-14

How to Cite

Liang, Y., Jiao, J., Feng, X., Liu, X., Liu, K., Wang, Y., … Wang, Z. (2026). IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 6907–6915. https://doi.org/10.1609/aaai.v40i9.37624

Issue

Section

AAAI Technical Track on Computer Vision VI