When Debate Fails: An Empirical Study of Incentive Misalignment in Prover–Estimator Games

Hannah Guan; Charles Hou

doi:10.1609/aaaiss.v9i1.42904

Authors

Hannah Guan Harvard University
Charles Hou Harvard University

DOI:

https://doi.org/10.1609/aaaiss.v9i1.42904

Abstract

Debate has been proposed as a mechanism for eliciting truthful reasoning from powerful learning agents by structuring interaction as a game between competing provers evaluated by an estimator. While theoretically appealing, the practical effectiveness of such incentive-based mechanisms remains underexplored. We present the first end-to-end empirical instantiation of this protocol using reinforcement learning agents with large language model on verifiable long-context reasoning tasks. Despite careful alignment of rewards with theoretical assumptions, we find that the debate mechanism consistently fails to elicit truthful or robust reasoning. Instead, agents converge to degenerate strategies that exploit estimator weaknesses or collapse into non-informative equilibria. Through controlled experiments and representation analysis using PCA and t-SNE, we identify a fundamental incentive–optimization mismatch: equilibrium incentives do not reliably translate into learnable equilibria under gradient-based optimization. Our results highlight a gap between game-theoretic guarantees and learned agent behavior, raising concerns about the reliability of debate-based alignment mechanisms in practice. We conclude with implications for incentive design and future directions for empirically grounded mechanism design in multi-agent systems.

When Debate Fails: An Empirical Study of Incentive Misalignment in Prover–Estimator Games

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information