Partially Non-Autoregressive Image Captioning

Zhengcong Fei

doi:10.1609/aaai.v35i2.16219

Authors

Zhengcong Fei Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China University of Chinese Academy of Sciences, Beijing 100049, China

DOI:

https://doi.org/10.1609/aaai.v35i2.16219

Keywords:

Language and Vision

Abstract

Current state-of-the-art image captioning systems usually generated descriptions autoregressively, i.e., every forward step conditions on the given image and previously produced words. The sequential attribution causes a unavoidable decoding latency. Non-autoregressive image captioning, on the other hand, predicts the entire sentence simultaneously and accelerates the inference process significantly. However, it removes the dependence in a caption and commonly suffers from repetition or missing issues. To make a better trade-off between speed and quality, we introduce a partially non-autoregressive model, named PNAIC, which considers a caption as a series of concatenated word groups. The groups are generated parallelly in global while each word in group is predicted from left to right, and thus the captioner can create multiple discontinuous words concurrently at each time step. More importantly, by incorporating curriculum learning-based training tasks of group length prediction and invalid group deletion, our model is capable of generating accurate captions as well as preventing common incoherent errors. Extensive experiments on MS COCO benchmark demonstrate that our proposed method achieves more than 3.5× speedup while maintaining competitive performance.

Partially Non-Autoregressive Image Captioning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription