Uncertainty-Aware Image Captioning

Authors

  • Zhengcong Fei Meituan
  • Mingyuan Fan Meituan
  • Li Zhu Meituan
  • Junshi Huang Meituan
  • Xiaoming Wei Meituan
  • Xiaolin Wei Meituan

DOI:

https://doi.org/10.1609/aaai.v37i1.25137

Keywords:

CV: Language and Vision, CV: Applications

Abstract

It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.

Downloads

Published

2023-06-26

How to Cite

Fei, Z., Fan, M., Zhu, L., Huang, J., Wei, X., & Wei, X. (2023). Uncertainty-Aware Image Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 614-622. https://doi.org/10.1609/aaai.v37i1.25137

Issue

Section

AAAI Technical Track on Computer Vision I