ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions
DOI:
https://doi.org/10.1609/aaai.v39i7.32796Abstract
In this study, we establish a benchmark and a baseline approach for Multimodal referring and grounding with Chain-of-Questions (MCQ), opening up a promising direction for ‘logical’ multimodal dialogues. The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. The baseline approach, termed ChatterBox, involves a modularized design and a referent feedback mechanism to ensure logical coherence in continuous referring and grounding tasks. This design reduces the risk of referential confusion, simplifies the training process, and presents validity in retaining the language model’s generation ability. Experiments show that ChatterBox demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions.Downloads
Published
2025-04-11
How to Cite
Tian, Y., Ma, T., Xie, L., & Ye, Q. (2025). ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7401–7409. https://doi.org/10.1609/aaai.v39i7.32796
Issue
Section
AAAI Technical Track on Computer Vision VI