An Evolutionary Perspective on AI Alignment (Student Abstract)

Ida Mattsson

doi:10.1609/aaai.v39i28.35276

Authors

Ida Mattsson Carnegie Mellon University

DOI:

https://doi.org/10.1609/aaai.v39i28.35276

Abstract

Attempting to align AI capabilities and value structures by means of value elicitation from humans, such as through Reinforcement Learning from Human Feedback (RLHF), is a computational challenge that raises both psychological and philosophical questions. Adopting an evolutionary perspective on the emergence of value structures in humans and machine learning systems can offer a bridge between qualitative and quantitative aspects of alignment. Here, evolutionary dynamics are applied to a game-theoretic model of RLHF. This allows for formal reasoning about the process and capabilities that result from alignment training, even where quantitative benchmarks cannot be clearly defined. A simple parametrized game model of RLHF, subject to replicator dynamics, shows how the success of the training method is sensitive to bias in human judgments. Under ideal conditions, RHLF training leads to aligned behavior. If the choice pattern of the human judge is biased, the training instead incentivizes misalignment. This application shows that evolutionary analyses can contribute to improving the prospects for safety and support successful cooperation between humans and AI systems in deployment.

An Evolutionary Perspective on AI Alignment (Student Abstract)

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information