LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

Authors

  • Shinnosuke Hirano Keio University
  • Yuiga Wada Keio University
  • Kazuki Matsuda Keio University
  • Seitaro Otsuki Keio University
  • Komei Sugiura Keio University

DOI:

https://doi.org/10.1609/aaai.v40i6.42472

Abstract

We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings.

Downloads

Published

2026-03-14

How to Cite

Hirano, S., Wada, Y., Matsuda, K., Otsuki, S., & Sugiura, K. (2026). LLM-Free Image Captioning Evaluation in Reference-Flexible Settings. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4708–4716. https://doi.org/10.1609/aaai.v40i6.42472

Issue

Section

AAAI Technical Track on Computer Vision III