[1]
Y. Liang, “IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding”, AAAI, vol. 40, no. 9, pp. 6907–6915, Mar. 2026.