ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Yunjie Tian; Tianren Ma; Lingxi Xie; Qixiang Ye

doi:10.1609/aaai.v39i7.32796

Authors

Yunjie Tian University of Chinese Academy of Sciences
Tianren Ma University of Chinese Academy of Sciences
Lingxi Xie Huawei Technologies Ltd.
Qixiang Ye University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v39i7.32796

Abstract

In this study, we establish a benchmark and a baseline approach for Multimodal referring and grounding with Chain-of-Questions (MCQ), opening up a promising direction for ‘logical’ multimodal dialogues. The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. The baseline approach, termed ChatterBox, involves a modularized design and a referent feedback mechanism to ensure logical coherence in continuous referring and grounding tasks. This design reduces the risk of referential confusion, simplifies the training process, and presents validity in retaining the language model’s generation ability. Experiments show that ChatterBox demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions.

ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information