ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation

Authors

  • Yunyi Liu University of Sydney
  • Yingshu Li University of Sydney
  • Zhanyu Wang University of Sydney
  • Xinyu Liang Binzhou Medical University
  • Lingqiao Liu University of Adelaide
  • Lei Wang University of Wollonong
  • Luping Zhou University of Sydney

DOI:

https://doi.org/10.1609/aaai.v40i9.37680

Abstract

Automated radiology report generation (R2Gen) has advanced significantly, yet evaluation remains challenging due to the complexity of assessing report quality. Traditional metrics often misalign with human judgments, failing to identify specific deficiencies. To address this, we introduce ReFINE, a framework for training an Evaluation Model using a novel margin-based reward enforcement loss. This approach decomposes report quality into fine-grained sub-scores across user-defined criteria, improving interpretability. Leveraging GPT-4, we generate diverse training data with paired accepted and rejected reports to train our model under a reward-based system. The trained ReFINE Score provides both granular sub-scores and an aggregated quality assessment, enabling criterion-specific evaluation. Experimental results demonstrate ReFINE's superior alignment with human judgments, outperforming traditional metrics in model selection. Its robustness is validated across three expert-annotated datasets—including chest X-rays and multimodal reports covering 9 imaging modalities—and under two distinct scoring systems.

Downloads

Published

2026-03-14

How to Cite

Liu, Y., Li, Y., Wang, Z., Liang, X., Liu, L., Wang, L., & Zhou, L. (2026). ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7413–7421. https://doi.org/10.1609/aaai.v40i9.37680

Issue

Section

AAAI Technical Track on Computer Vision VI