ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation
DOI:
https://doi.org/10.1609/aaai.v40i9.37680Abstract
Automated radiology report generation (R2Gen) has advanced significantly, yet evaluation remains challenging due to the complexity of assessing report quality. Traditional metrics often misalign with human judgments, failing to identify specific deficiencies. To address this, we introduce ReFINE, a framework for training an Evaluation Model using a novel margin-based reward enforcement loss. This approach decomposes report quality into fine-grained sub-scores across user-defined criteria, improving interpretability. Leveraging GPT-4, we generate diverse training data with paired accepted and rejected reports to train our model under a reward-based system. The trained ReFINE Score provides both granular sub-scores and an aggregated quality assessment, enabling criterion-specific evaluation. Experimental results demonstrate ReFINE's superior alignment with human judgments, outperforming traditional metrics in model selection. Its robustness is validated across three expert-annotated datasets—including chest X-rays and multimodal reports covering 9 imaging modalities—and under two distinct scoring systems.Downloads
Published
2026-03-14
How to Cite
Liu, Y., Li, Y., Wang, Z., Liang, X., Liu, L., Wang, L., & Zhou, L. (2026). ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7413–7421. https://doi.org/10.1609/aaai.v40i9.37680
Issue
Section
AAAI Technical Track on Computer Vision VI