Show, Recall, and Tell: Image Captioning with Recall Mechanism


  • Li Wang Shanghai Jiao Tong University
  • Zechen Bai Chinese Academy of Sciences
  • Yonghua Zhang Bytedance
  • Hongtao Lu Shanghai Jiao Tong University



Generating natural and accurate descriptions in image captioning has always been a challenge. In this paper, we propose a novel recall mechanism to imitate the way human conduct captioning. There are three parts in our recall mechanism : recall unit, semantic guide (SG) and recalled-word slot (RWS). Recall unit is a text-retrieval module designed to retrieve recalled words for images. SG and RWS are designed for the best use of recalled words. SG branch can generate a recalled context, which can guide the process of generating caption. RWS branch is responsible for copying recalled words to the caption. Inspired by pointing mechanism in text summarization, we adopt a soft switch to balance the generated-word probabilities between SG and RWS. In the CIDEr optimization step, we also introduce an individual recalled-word reward (WR) to boost training. Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 / 129.1 / 22.4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods.




How to Cite

Wang, L., Bai, Z., Zhang, Y., & Lu, H. (2020). Show, Recall, and Tell: Image Captioning with Recall Mechanism. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 12176-12183.



AAAI Technical Track: Vision