EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation

Shuhao Han; Haotian Fan; Jiachen Fu; Liang Li; Tao Li; Junhui Cui; Yunqiu Wang; Yang Tai; Jingwei Sun; Chun-Le Guo; Chongyi Li

doi:10.1609/aaai.v40i6.42458

Authors

Shuhao Han Nankai University ByteDance Inc.
Haotian Fan ByteDance Inc.
Jiachen Fu Nankai University
Liang Li ByteDance Inc.
Tao Li ByteDance Inc.
Junhui Cui ByteDance Inc.
Yunqiu Wang ByteDance Inc.
Yang Tai ByteDance Inc.
Jingwei Sun ByteDance Inc.
Chun-Le Guo Nankai University NKIARI, Shenzhen Futian
Chongyi Li Nankai University NKIARI, Shenzhen Futian

DOI:

https://doi.org/10.1609/aaai.v40i6.42458

Abstract

Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated methods emerge to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated methods is constrained by the limited scale of existing datasets. Additionally, existing datasets lack the capacity to assess the performance of automated methods at a fine-grained level. In this study, we contribute an EvalMuse-40K dataset, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our dataset. This allows us to comprehensively evaluate the performance of image-text alignment methods for T2I models. Based on this dataset, we introduce an efficient automated evaluation method termed FGA-BLIP2, which enables Fine-Grained Alignment evaluation solely by inputting images and text leveraging BLIP2, without visual question answering for each fine-grained element. Experimental results show the proposed FGA-BLIP2 efficiently achieves good performance on multiple image-text alignment datasets. Meanwhile, benefiting from the high efficiency and fine-grained evaluation capability of FGA-BLIP2, we apply it as a reward model to improve text-to-image models, which effectively enhances the image-text alignment ability of text-to-image models.

EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information