Maximizing Signal in Human-Model Preference Alignment

Authors

  • Kelsey Kraus Cisco Systems
  • Margaret Kroll Cisco Systems

DOI:

https://doi.org/10.1609/aaai.v39i26.34950

Abstract

The emergence of powerful LLMs has led to a paradigm shift in Natural Language Understanding and Natural Language Generation. But the properties that make LLMs so valuable for these tasks -- creativity, ability to produce fluent speech, and ability to quickly and effectively abstract information from large corpora -- also present new challenges to evaluating their outputs. The rush to market has led teams to fall back on quick, cost-effective automatic evaluations which offer value, but do not obviate the need for human judgments in model training and evaluation. We argue that when end users need to agree with the decisions made by ML models -- e.g. in toxicity detection or in extraction of main points for summarization -- models should be trained and tested on data that represent the preferences of those users. This paper primarily discusses the role of human feedback in labeling and judgment tasks for model training and evaluation. We first propose methods for disentangling noise from signal in labeling tasks. We show that noise in labeling disagreement can be minimized by adhering to proven methodological best practices, while signal in labeling disagreement can be maximized to play an integral role in model training and evaluation tasks. We illustrate best practices by providing a case study in which two guardrail classifiers are evaluated, using human judgments to align final model behavior to user preferences. We aim for this paper to provide researchers and professionals with guidelines to integrating human judgments into their ML and generative AI evaluation toolkit when working toward achieving accurate and unbiased features that align with users’ needs and expectations.

Downloads

Published

2025-04-11

How to Cite

Kraus, K., & Kroll, M. (2025). Maximizing Signal in Human-Model Preference Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27392-27400. https://doi.org/10.1609/aaai.v39i26.34950

Issue

Section

AAAI Technical Track on AI Alignment