Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

Authors

  • Yifan Lu State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA School of Artificial Intelligence, University of Chinese Academy of Sciences
  • Ziqi Zhang State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
  • Chunfeng Yuan State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
  • Peng Li Alibaba Group Zhejiang Linkheer Science and Technology Co., Ltd.
  • Yan Wang Alibaba Group Zhejiang Linkheer Science and Technology Co., Ltd.
  • Bing Li State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
  • Weiming Hu State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA School of Artificial Intelligence, University of Chinese Academy of Sciences School of Information Science and Technology, ShanghaiTech University

DOI:

https://doi.org/10.1609/aaai.v38i4.28183

Keywords:

CV: Language and Vision

Abstract

Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.

Published

2024-03-24

How to Cite

Lu, Y., Zhang, Z., Yuan, C., Li, P., Wang, Y., Li, B., & Hu, W. (2024). Set Prediction Guided by Semantic Concepts for Diverse Video Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3909-3917. https://doi.org/10.1609/aaai.v38i4.28183

Issue

Section

AAAI Technical Track on Computer Vision III