The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications

Serena Booth; W. Bradley Knox; Julie Shah; Scott Niekum; Peter Stone; Alessandro Allievi

doi:10.1609/aaai.v37i5.25733

Authors

Serena Booth Bosch The University of Texas at Austin MIT CSAIL
W. Bradley Knox Bosch The University of Texas at Austin Google Research
Julie Shah MIT CSAIL
Scott Niekum The University of Texas at Austin The University of Massachusetts at Amherst
Peter Stone The University of Texas at Austin Sony AI
Alessandro Allievi Bosch The University of Texas at Austin

DOI:

https://doi.org/10.1609/aaai.v37i5.25733

Keywords:

HAI: Learning Human Values and Preferences, HAI: Human-in-the-Loop Machine Learning, HAI: Other Foundations of Humans & AI, ML: Auto ML and Hyperparameter Tuning, ML: Reinforcement Learning Algorithms

Abstract

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often necessarily sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. The sparsity of these true task metrics can make them hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. This process raises the question of whether the same reward function is optimal for all algorithms, i.e., whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. We then conduct a controlled observation study which emulates expert practitioners' typical experiences of reward design, in which we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: github.com/serenabooth/reward-design-perils

The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription