GeWu: A Culturally-Grounded Chinese Benchmark for Multi-Stage Social Bias Evaluation in Large Language Models

Yi Lin; Ziyi Zhou; Jiashi Gao; Xinwei Guo; Jiaxin Zhang; Haiyan Wu; Xin Yao; Xuetao Wei

doi:10.1609/aaai.v40i38.40474

Authors

Yi Lin Southern University of Science and Technology
Ziyi Zhou Southern University of Science and Technology
Jiashi Gao Southern University of Science and Technology
Xinwei Guo Southern University of Science and Technology
Jiaxin Zhang Southern University of Science and Technology
Haiyan Wu University of Macau
Xin Yao Lingnan University
Xuetao Wei Southern University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i38.40474

Abstract

With the rapid deployment of Chinese large language models (LLMs), culturally-grounded bias evaluation remains understudied due to the dominance of English benchmarks and simplistic Chinese scenarios. To address this, we propose GeWu, a comprehensive benchmark featuring a culturally-aware dataset of 60,192 questions spanning 14 social groups with fine-grained Chinese contexts, significantly exceeding existing resources in breadth and depth. Our two-stage evaluation first quantifies bias via multiple-choice questions using a novel probability-based scoring mechanism to sensitively capture bias tendencies, distilling high-bias scenarios into GeWu-1K. This refined subset then enables multi-turn dialogue evaluations for in-depth analysis under realistic conditions. Experiments reveal that GeWu effectively exposes social biases in state-of-the-art Chinese LLMs, with 13.93% of scenarios eliciting universal bias across all models. This highlights persistent challenges and provides actionable insights for bias mitigation in Chinese contexts.

GeWu: A Culturally-Grounded Chinese Benchmark for Multi-Stage Social Bias Evaluation in Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information