OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Authors

  • Chunlin Zhong Huazhong University of Science and Technology
  • Qiuxia Hou Oppo AI center
  • Zhangjun Zhou Huazhong University of Science and Technology
  • Yanhao Zhang Oppo AI center
  • Shuang Hao Huazhong University of Science and Technology, Xi'an Jiaotong University
  • Haonan Lu Oppo AI center
  • He Tang Huazhong University of Science and Technology
  • Xiang Bai Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i16.38355

Abstract

Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning Multi-modal Large Language Model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1).

Downloads

Published

2026-03-14

How to Cite

Zhong, C., Hou, Q., Zhou, Z., Zhang, Y., Hao, S., Lu, H., … Bai, X. (2026). OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13503–13511. https://doi.org/10.1609/aaai.v40i16.38355

Issue

Section

AAAI Technical Track on Computer Vision XIII