Does Question Really Matter? The Attribution of Answer Bias in LLM Evaluation

Authors

  • Boxi Cao Institute of Software, Chinese Academy of Sciences
  • Ruotong Pan Institute of Software, Chinese Academy of Sciences
  • Hongyu Lin Institute of Software, Chinese Academy of Sciences
  • Xianpei Han Institute of Software, Chinese Academy of Sciences
  • Le Sun Institute of Software, Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i36.40262

Abstract

Multiple-choices question answering (MCQA) has emerged as one of the most popular task formats for large language models (LLMs) evaluation. Unfortunately, there exist substantial evidence that the evaluation of current MCQA benchmarks suffers from significant answer bias, which severely undermines the reliability of the evaluation conclusions. Specifically, many LLMs achieve performance significantly higher than random selection even when the questions are omitted from input information. To this end, we conduct a systematic investigation of the attribution of answer bias, and demonstrate a strong correlation between the degree of data contamination and the severity of answer bias, while the position of options and the popularity of answers have relatively minor effects. Building on these insights, we further propose OPD, a straightforward yet effective tool for contamination detection and dataset debiasing without requiring access to the model’s internal training data. Our findings and algorithms provide valuable insights for the design of future trustworthy LLM evaluation protocols.

Downloads

Published

2026-03-14

How to Cite

Cao, B., Pan, R., Lin, H., Han, X., & Sun, L. (2026). Does Question Really Matter? The Attribution of Answer Bias in LLM Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30130-30138. https://doi.org/10.1609/aaai.v40i36.40262

Issue

Section

AAAI Technical Track on Natural Language Processing I