Visual Question Answer Diversity
Visual questions (VQs) can lead multiple people to respond with different answers rather than a single, agreed upon response. Moreover, the answers from a crowd can include different numbers of unique answers that arise with different relative frequencies. Such answer diversity arises for a variety of reasons including that VQs are subjective, difficult, or ambiguous. We propose a new problem of predicting the answer distribution that would be observed from a crowd for any given VQ; i.e., the number of unique answers and their relative frequencies. Our experiments confirm that the answer distribution can be predicted accurately for VQs asked by both blind and sighted people. We then propose a novel crowd-powered VQA system that uses the answer distribution predictions to reason about how many answers are needed to capture the diversity of possible human responses. Experiments demonstrate this proposed system accelerates capturing the diversity of answers with considerably less human effort than is required with a state-of-art system.