Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Authors

  • Peize Li School of Artificial Intelligence, Jilin University, Changchun, China Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
  • Qingyi Si Huawei Technologies Co., Ltd., Beijing, China
  • Peng Fu Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
  • Zheng Lin Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
  • Yan Wang Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China School of Artificial Intelligence, Jilin University, Changchun, China

DOI:

https://doi.org/10.1609/aaai.v39i5.32513

Abstract

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Published

2025-04-11

How to Cite

Li, P., Si, Q., Fu, P., Lin, Z., & Wang, Y. (2025). Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 39(5), 4851-4859. https://doi.org/10.1609/aaai.v39i5.32513

Issue

Section

AAAI Technical Track on Computer Vision IV