GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation

Authors

  • Zhenxuan Zhang Department of Bioengineering, Imperial College London, UK
  • KinHei Lee Department of Bioengineering, Imperial College London, UK
  • Peiyuan Jing Department of Bioengineering, Imperial College London, UK
  • Weihang Deng Department of Bioengineering, Imperial College London, UK
  • Huichi Zhou Department of Bioengineering, Imperial College London, UK
  • Zihao Jin Department of Bioengineering, Imperial College London, UK
  • Jiahao Huang Department of Bioengineering, Imperial College London, UK
  • Zhifan Gao School of Biomedical Engineering, Sun Yat-sen University, China
  • Dominic C. Marshall Department of Surgery & Cancer, Imperial College London, UK
  • Yingying Fang Department of Bioengineering, Imperial College London, UK
  • Guang Yang Department of Bioengineering, Imperial College London, UK

DOI:

https://doi.org/10.1609/aaai.v40i15.38302

Abstract

Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Overlap-based methods overlook fine-grained details (e.g., location, severity), diagnostic metrics are constrained by fixed vocabularies. Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM-based metrics lack interpretable reasoning, limiting trust in clinical settings. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments show that GEMA-Score achieves the highest correlation with human experts on public datasets (Kendall = 0.69 on ReXVal; 0.45 on RadEvalX), demonstrating improved clinical scoring reliability.

Published

2026-03-14

How to Cite

Zhang, Z., Lee, K., Jing, P., Deng, W., Zhou, H., Jin, Z., … Yang, G. (2026). GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 13025–13033. https://doi.org/10.1609/aaai.v40i15.38302

Issue

Section

AAAI Technical Track on Computer Vision XII