Augmented Partial Mutual Learning with Frame Masking for Video Captioning

Authors

  • Ke Lin Peking University Samsung Research China-Beijing (SRC-B)
  • Zhuoxin Gan Samsung Research China-Beijing (SRC-B)
  • Liwei Wang Peking University

Keywords:

Language and Vision, Multi-modal Vision, Video Understanding & Activity Analysis

Abstract

Recent video captioning work improves greatly due to the invention of various elaborate model architectures. If multiple captioning models are combined into a unified framework not only by simple more ensemble, and each model can benefit from each other, the final captioning might be boosted further. Jointly training of multiple model have not been explored in previous works. In this paper, we propose a novel Augmented Partial Mutual Learning (APML) training method where multiple decoders are trained jointly with mimicry losses between different decoders and different input variations. Another problem of training captioning model is the "one-to-many" mapping problem which means that one identical video input is mapped to multiple caption annotations. To address this problem, we propose an annotation-wise frame masking approach to convert the "one-to-many" mapping to "one-to-one" mapping. The experiments performed on MSR-VTT and MSVD datasets demonstrate our proposed algorithm achieves the state-of-the-art performance.

Downloads

Published

2021-05-18

How to Cite

Lin, K., Gan, Z., & Wang, L. (2021). Augmented Partial Mutual Learning with Frame Masking for Video Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(3), 2047-2055. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16301

Issue

Section

AAAI Technical Track on Computer Vision II