DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Authors

  • Nakamasa Inoue Institute of Science Tokyo
  • Kanoko Goto Institute of Science Tokyo
  • Masanari Oi Institute of Science Tokyo
  • Martyna Gruszka Institute of Science Tokyo
  • Mahiro Ukai Institute of Science Tokyo
  • Takumi Hirose Institute of Science Tokyo
  • Yusuke Sekikawa DENSO IT Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i7.37440

Abstract

Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

Downloads

Published

2026-03-14

How to Cite

Inoue, N., Goto, K., Oi, M., Gruszka, M., Ukai, M., Hirose, T., & Sekikawa, Y. (2026). DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5248–5256. https://doi.org/10.1609/aaai.v40i7.37440

Issue

Section

AAAI Technical Track on Computer Vision IV