Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Peize Li; Qingyi Si; Peng Fu; Zheng Lin; Yan Wang

doi:10.1609/aaai.v39i5.32513

Authors

Peize Li School of Artificial Intelligence, Jilin University, Changchun, China Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Qingyi Si Huawei Technologies Co., Ltd., Beijing, China
Peng Fu Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Zheng Lin Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yan Wang Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China School of Artificial Intelligence, Jilin University, Changchun, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32513

Abstract

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information