Does Question Really Matter? The Attribution of Answer Bias in LLM Evaluation

Boxi Cao; Ruotong Pan; Hongyu Lin; Xianpei Han; Le Sun

doi:10.1609/aaai.v40i36.40262

Authors

Boxi Cao Institute of Software, Chinese Academy of Sciences
Ruotong Pan Institute of Software, Chinese Academy of Sciences
Hongyu Lin Institute of Software, Chinese Academy of Sciences
Xianpei Han Institute of Software, Chinese Academy of Sciences
Le Sun Institute of Software, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i36.40262

Abstract

Multiple-choices question answering (MCQA) has emerged as one of the most popular task formats for large language models (LLMs) evaluation. Unfortunately, there exist substantial evidence that the evaluation of current MCQA benchmarks suffers from significant answer bias, which severely undermines the reliability of the evaluation conclusions. Specifically, many LLMs achieve performance significantly higher than random selection even when the questions are omitted from input information. To this end, we conduct a systematic investigation of the attribution of answer bias, and demonstrate a strong correlation between the degree of data contamination and the severity of answer bias, while the position of options and the popularity of answers have relatively minor effects. Building on these insights, we further propose OPD, a straightforward yet effective tool for contamination detection and dataset debiasing without requiring access to the model’s internal training data. Our findings and algorithms provide valuable insights for the design of future trustworthy LLM evaluation protocols.

Does Question Really Matter? The Attribution of Answer Bias in LLM Evaluation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information