AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Yun Wang; Zhaojun Ding; Xuansheng Wu; Siyue Sun; Ninghao Liu; Xiaoming Zhai

doi:10.1609/aaai.v40i48.42123

Authors

Yun Wang School of Computing, University of Georgia
Zhaojun Ding School of Computing, University of Georgia
Xuansheng Wu School of Computing, University of Georgia
Siyue Sun Khoury College of Computer Sciences, Northeastern University
Ninghao Liu School of Computing, University of Georgia
Xiaoming Zhai AI4STEM Education Center, University of Georgia

DOI:

https://doi.org/10.1609/aaai.v40i48.42123

Abstract

Automated scoring plays a crucial role in education by reducing the reliance on human raters and offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment, which hinder practical implementation. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE predominantly improves scoring accuracy, human-machine agreement (QWK, correlations), and reduces error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multidimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information