EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation

Authors

  • Shuhao Han Nankai University ByteDance Inc.
  • Haotian Fan ByteDance Inc.
  • Jiachen Fu Nankai University
  • Liang Li ByteDance Inc.
  • Tao Li ByteDance Inc.
  • Junhui Cui ByteDance Inc.
  • Yunqiu Wang ByteDance Inc.
  • Yang Tai ByteDance Inc.
  • Jingwei Sun ByteDance Inc.
  • Chun-Le Guo Nankai University NKIARI, Shenzhen Futian
  • Chongyi Li Nankai University NKIARI, Shenzhen Futian

DOI:

https://doi.org/10.1609/aaai.v40i6.42458

Abstract

Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated methods emerge to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated methods is constrained by the limited scale of existing datasets. Additionally, existing datasets lack the capacity to assess the performance of automated methods at a fine-grained level. In this study, we contribute an EvalMuse-40K dataset, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our dataset. This allows us to comprehensively evaluate the performance of image-text alignment methods for T2I models. Based on this dataset, we introduce an efficient automated evaluation method termed FGA-BLIP2, which enables Fine-Grained Alignment evaluation solely by inputting images and text leveraging BLIP2, without visual question answering for each fine-grained element. Experimental results show the proposed FGA-BLIP2 efficiently achieves good performance on multiple image-text alignment datasets. Meanwhile, benefiting from the high efficiency and fine-grained evaluation capability of FGA-BLIP2, we apply it as a reward model to improve text-to-image models, which effectively enhances the image-text alignment ability of text-to-image models.

Downloads

Published

2026-03-14

How to Cite

Han, S., Fan, H., Fu, J., Li, L., Li, T., Cui, J., … Li, C. (2026). EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4583–4591. https://doi.org/10.1609/aaai.v40i6.42458

Issue

Section

AAAI Technical Track on Computer Vision III