Diverse Beam Search for Improved Description of Complex Scenes

Ashwin Vijayakumar; Michael Cogswell; Ramprasaath Selvaraju; Qing Sun; Stefan Lee; David Crandall; Dhruv Batra

doi:10.1609/aaai.v32i1.12340

Authors

Ashwin Vijayakumar Georgia Tech
Michael Cogswell Georgia Tech
Ramprasaath Selvaraju Georgia Tech
Qing Sun Virginia Tech
Stefan Lee Georgia Tech
David Crandall Indiana University
Dhruv Batra Georgia Tech; Facebook AI Research

DOI:

https://doi.org/10.1609/aaai.v32i1.12340

Keywords:

Recurrent Neural Networks, Beam Search, Diversity

Abstract

A single image captures the appearance and position of multiple entities in a scene as well as their complex interactions. As a consequence, natural language grounded in visual contexts tends to be diverse---with utterances differing as focus shifts to specific objects, interactions, or levels of detail. Recently, neural sequence models such as RNNs and LSTMs have been employed to produce visually-grounded language. Beam Search, the standard work-horse for decoding sequences from these models, is an approximate inference algorithm that decodes the top-B sequences in a greedy left-to-right fashion. In practice, the resulting sequences are often minor rewordings of a common utterance, failing to capture the multimodal nature of source images. To address this shortcoming, we propose Diverse Beam Search (DBS), a diversity promoting alternative to BS for approximate inference. DBS produces sequences that are significantly different from each other by incorporating diversity constraints within groups of candidate sequences during decoding; moreover, it achieves this with minimal computational or memory overhead. We demonstrate that our method improves both diversity and quality of decoded sequences over existing techniques on two visually-grounded language generation tasks---image captioning and visual question generation---particularly on complex scenes containing diverse visual content. We also show similar improvements at language-only machine translation tasks, highlighting the generality of our approach.

Diverse Beam Search for Improved Description of Complex Scenes

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information