Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated

Katherine Metcalf; Miguel Sarabia; Masha Fedzechkina; Barry-John Theobald

doi:10.1609/aaai.v38i9.28877

Authors

Katherine Metcalf Apple
Miguel Sarabia Apple
Masha Fedzechkina Apple
Barry-John Theobald Apple

DOI:

https://doi.org/10.1609/aaai.v38i9.28877

Keywords:

HAI: Learning Human Values and Preferences, HAI: Human-in-the-loop Machine Learning, ML: Reinforcement Learning, ML: Representation Learning

Abstract

Preference-based Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training.

Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It’s Complicated

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription