Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions
DOI:
https://doi.org/10.1609/aaai.v39i18.34157Abstract
LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.Downloads
Published
2025-04-11
How to Cite
Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C. S., … Sarkar, A. (2025). Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions. Proceedings of the AAAI Conference on Artificial Intelligence, 39(18), 19589–19597. https://doi.org/10.1609/aaai.v39i18.34157
Issue
Section
AAAI Technical Track on Machine Learning IV