Enhancing Fairness in LLM Evaluations: Unveiling and Mitigating Biases in Standard-Answer-Based Evaluations

Authors

  • Tong Jiao Carnegie Mellon University
  • Jian Zhang China Mobile
  • Kui Xu China Mobile
  • Rui Li China Mobile
  • Xi Du China Mobile
  • Shangqi Wang China Mobile
  • Zhenbo Song Nanjing University of Science and Technology

DOI:

https://doi.org/10.1609/aaaiss.v4i1.31771

Abstract

Large Language Models (LLMs) are recognized for their effectiveness in comparing two answers. However, LLMs can still exhibit biases when comparing one answer to a standard answer, particularly in real-world scenarios like new employee orientations. This paper identifies positional and verbosity biases in LLM evaluators in such contexts. To mitigate these biases, we apply Chain of Thought prompting and Multi-Agent Debate strategies. Our research reveals that bias prevalence varies among different models, indicating the need for tailored approaches to ensure unbiased and constructive feedback.

Downloads

Published

2024-11-08

How to Cite

Jiao, T., Zhang, J., Xu, K., Li, R., Du, X., Wang, S., & Song, Z. (2024). Enhancing Fairness in LLM Evaluations: Unveiling and Mitigating Biases in Standard-Answer-Based Evaluations. Proceedings of the AAAI Symposium Series, 4(1), 56-59. https://doi.org/10.1609/aaaiss.v4i1.31771

Issue

Section

AI Trustworthiness and Risk Assessment for Challenging Contexts (ATRACC)