EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation
DOI:
https://doi.org/10.1609/aaai.v40i6.42458Abstract
Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated methods emerge to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated methods is constrained by the limited scale of existing datasets. Additionally, existing datasets lack the capacity to assess the performance of automated methods at a fine-grained level. In this study, we contribute an EvalMuse-40K dataset, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our dataset. This allows us to comprehensively evaluate the performance of image-text alignment methods for T2I models. Based on this dataset, we introduce an efficient automated evaluation method termed FGA-BLIP2, which enables Fine-Grained Alignment evaluation solely by inputting images and text leveraging BLIP2, without visual question answering for each fine-grained element. Experimental results show the proposed FGA-BLIP2 efficiently achieves good performance on multiple image-text alignment datasets. Meanwhile, benefiting from the high efficiency and fine-grained evaluation capability of FGA-BLIP2, we apply it as a reward model to improve text-to-image models, which effectively enhances the image-text alignment ability of text-to-image models.Downloads
Published
2026-03-14
How to Cite
Han, S., Fan, H., Fu, J., Li, L., Li, T., Cui, J., … Li, C. (2026). EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4583–4591. https://doi.org/10.1609/aaai.v40i6.42458
Issue
Section
AAAI Technical Track on Computer Vision III