MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Authors

  • Sangho Lee Sungkyunkwan University, Suwon 16419, Republic of Korea Hippo T&C Company, Limited, Suwon 16419, Republic of Korea
  • Il Yong Chun Sungkyunkwan University, Suwon 16419, Republic of Korea Center for Neuroscience Imaging Research, Institute for Basic Science (IBS), Suwon 16419, Republic of Korea
  • Hogun Park Sungkyunkwan University, Suwon 16419, Republic of Korea

DOI:

https://doi.org/10.1609/aaai.v39i5.32478

Abstract

Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods extract a fixed number of frames, but this has critical challenges. If a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our numerical experiments with three different benchmark datasets demonstrate that the proposed framework significantly improves the performances of three recent video captioning models.

Published

2025-04-11

How to Cite

Lee, S., Chun, I. Y., & Park, H. (2025). MAMS: Model-Agnostic Module Selection Framework for Video Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 4535-4543. https://doi.org/10.1609/aaai.v39i5.32478

Issue

Section

AAAI Technical Track on Computer Vision IV