GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation

Zhenxuan Zhang; KinHei Lee; Peiyuan Jing; Weihang Deng; Huichi Zhou; Zihao Jin; Jiahao Huang; Zhifan Gao; Dominic C. Marshall; Yingying Fang; Guang Yang

doi:10.1609/aaai.v40i15.38302

Authors

Zhenxuan Zhang Department of Bioengineering, Imperial College London, UK
KinHei Lee Department of Bioengineering, Imperial College London, UK
Peiyuan Jing Department of Bioengineering, Imperial College London, UK
Weihang Deng Department of Bioengineering, Imperial College London, UK
Huichi Zhou Department of Bioengineering, Imperial College London, UK
Zihao Jin Department of Bioengineering, Imperial College London, UK
Jiahao Huang Department of Bioengineering, Imperial College London, UK
Zhifan Gao School of Biomedical Engineering, Sun Yat-sen University, China
Dominic C. Marshall Department of Surgery & Cancer, Imperial College London, UK
Yingying Fang Department of Bioengineering, Imperial College London, UK
Guang Yang Department of Bioengineering, Imperial College London, UK

DOI:

https://doi.org/10.1609/aaai.v40i15.38302

Abstract

Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Overlap-based methods overlook fine-grained details (e.g., location, severity), diagnostic metrics are constrained by fixed vocabularies. Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM-based metrics lack interpretable reasoning, limiting trust in clinical settings. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments show that GEMA-Score achieves the highest correlation with human experts on public datasets (Kendall = 0.69 on ReXVal; 0.45 on RadEvalX), demonstrating improved clinical scoring reliability.

GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information