EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

Authors

  • Fan Gao Huawei Technologies Ltd. The University of Tokyo
  • Dongyuan Li The University of Tokyo
  • Ding Xia The University of Tokyo
  • Fei Mi Huawei Technologies Ltd.
  • Yasheng Wang Huawei Technologies Ltd.
  • Lifeng Shang Huawei Technologies Ltd.
  • Baojun Wang Huawei Technologies Ltd.

DOI:

https://doi.org/10.1609/aaai.v40i44.41072

Abstract

Prompt-based essay writing is an effective and common way to assess students' critical thinking skills. Recent work has evaluated the impressive capabilities of Large Language Models (LLMs) on this task. However, most studies focus primarily on English. Those examining LLMs' performance in Chinese often rely on coarse-grained text quality metrics, overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. We therefore propose EssayBench, a multi-genre benchmark specifically designed for Chinese essay writing, along with a fine-grained, genre-specific scoring framework that hierarchically aggregates scores to better align with human preferences. The dataset comprises 728 real-world prompts across four major genres (Argumentative, Narrative, Descriptive, and Expository), and includes both Open-Ended and Constrained types. Our evaluation protocol is validated through a comprehensive human agreement study. The results show that our protocol aligns well with human judgments, achieving a highest Spearman's correlation of 0.816 and outperforming coarse-grained evaluation methods by an average of 8.6\%. Finally, we benchmark 15 large LLMs, analyzing their strengths and limitations across genres and instruction types. We believe EssayBench offers a more reliable framework for evaluating Chinese essay generation and provides valuable insights for improving LLMs in this domain.

Downloads

Published

2026-03-14

How to Cite

Gao, F., Li, D., Xia, D., Mi, F., Wang, Y., Shang, L., & Wang, B. (2026). EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37396–37406. https://doi.org/10.1609/aaai.v40i44.41072

Issue

Section

AAAI Special Track on AI Alignment