Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Authors

  • Yolo Yunlong Tang University of Rochester
  • Jing Bi University of Rochester
  • Chao Huang University of Rochester
  • Susan Liang University of Rochester
  • Daiki Shimada Sony Group Corporation
  • Hang Hua University of Rochester
  • Yunzhong Xiao Carnegie Mellon University
  • Yizhi Song Purdue University
  • Pinxin Liu University of Rochester
  • Mingqian Feng University of Rochester
  • Junjia Guo University of Rochester
  • Zhuo Liu University of Rochester
  • Luchuan Song University of Rochester
  • Ali Vosoughi University of Rochester
  • Jinxi He University of Rochester
  • Liu He Purdue University
  • Zeliang Zhang University of Rochester
  • Jiebo Luo University of Rochester
  • Chenliang Xu University of Rochester

DOI:

https://doi.org/10.1609/aaai.v40i48.42385

Abstract

In this work, we introduce CAT-V (Caption Anything in Video), a training-free framework for fine-grained object-centric video captioning of user-selected instances. CAT-V combines (i) a SAMURAI-based Segmenter for precise object masks across frames, (ii) a TRACE-Uni Temporal Analyzer for event boundary detection and coarse event descriptions, and (iii) an InternVL-2.5 Captioner that, conditioned on spatiotemporal visual prompts and chain-of-thought (CoT) guidance, produces detailed, temporally coherent captions about object attributes, actions, states, interactions, and context. The system supports point, box, and region prompts and maintains temporal sensitivity by tracking object states across segments. In contrast to vanilla video captioning that is overly abstract and dense video captioning that is often terse, CAT-V enables object-level specificity with spatial accuracy and temporal coherence, without additional training data.

Downloads

Published

2026-03-14

How to Cite

Tang, Y. Y., Bi, J., Huang, C., Liang, S., Shimada, D., Hua, H., … Xu, C. (2026). Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41697–41699. https://doi.org/10.1609/aaai.v40i48.42385