CharacterBench: Benchmarking Character Customization of Large Language Models

Jinfeng Zhou; Yongkang Huang; Bosi Wen; Guanqun Bi; Yuxuan Chen; Pei Ke; Zhuang Chen; Xiyao Xiao; Libiao Peng; Kuntian Tang; Rongsheng Zhang; Le Zhang; Tangjie Lv; Zhipeng Hu; Hongning Wang; Minlie Huang

doi:10.1609/aaai.v39i24.34806

Authors

Jinfeng Zhou Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University
Yongkang Huang Lingxin AI Northwest Minzu University
Bosi Wen Tsinghua University The CoAI Group, DCST, Tsinghua University
Guanqun Bi Tsinghua University The CoAI Group, DCST, Tsinghua University
Yuxuan Chen Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University
Pei Ke Tsinghua University The CoAI Group, DCST, Tsinghua University
Zhuang Chen Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University
Xiyao Xiao Lingxin AI Beijing Normal University
Libiao Peng Lingxin AI Tsinghua University, Tsinghua University
Kuntian Tang Lingxin AI Guangdong University of Finance & Economics
Rongsheng Zhang Fuxi AI Lab, Netease
Le Zhang Fuxi AI Lab, Netease
Tangjie Lv Fuxi AI Lab, Netease
Zhipeng Hu Fuxi AI Lab, Netease
Hongning Wang Tsinghua University The CoAI Group, DCST, Tsinghua University
Minlie Huang Tsinghua University, Tsinghua University The CoAI Group, DCST, Tsinghua University

DOI:

https://doi.org/10.1609/aaai.v39i24.34806

Abstract

Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs’ character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters’ responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark’s potential to optimize LLMs’ character customization.

CharacterBench: Benchmarking Character Customization of Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information