Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated

Authors

  • Katherine Metcalf Apple
  • Miguel Sarabia Apple
  • Masha Fedzechkina Apple
  • Barry-John Theobald Apple

DOI:

https://doi.org/10.1609/aaai.v38i9.28877

Keywords:

HAI: Learning Human Values and Preferences, HAI: Human-in-the-loop Machine Learning, ML: Reinforcement Learning, ML: Representation Learning

Abstract

Preference-based Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training.

Published

2024-03-24

How to Cite

Metcalf, K., Sarabia, M., Fedzechkina, M., & Theobald, B.-J. (2024). Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated. Proceedings of the AAAI Conference on Artificial Intelligence, 38(9), 10128-10136. https://doi.org/10.1609/aaai.v38i9.28877

Issue

Section

AAAI Technical Track on Humans and AI