When Debate Fails: An Empirical Study of Incentive Misalignment in Prover–Estimator Games
DOI:
https://doi.org/10.1609/aaaiss.v9i1.42904Abstract
Debate has been proposed as a mechanism for eliciting truthful reasoning from powerful learning agents by structuring interaction as a game between competing provers evaluated by an estimator. While theoretically appealing, the practical effectiveness of such incentive-based mechanisms remains underexplored. We present the first end-to-end empirical instantiation of this protocol using reinforcement learning agents with large language model on verifiable long-context reasoning tasks. Despite careful alignment of rewards with theoretical assumptions, we find that the debate mechanism consistently fails to elicit truthful or robust reasoning. Instead, agents converge to degenerate strategies that exploit estimator weaknesses or collapse into non-informative equilibria. Through controlled experiments and representation analysis using PCA and t-SNE, we identify a fundamental incentive–optimization mismatch: equilibrium incentives do not reliably translate into learnable equilibria under gradient-based optimization. Our results highlight a gap between game-theoretic guarantees and learned agent behavior, raising concerns about the reliability of debate-based alignment mechanisms in practice. We conclude with implications for incentive design and future directions for empirically grounded mechanism design in multi-agent systems.Downloads
Published
2026-06-23
How to Cite
Guan, H., & Hou, C. (2026). When Debate Fails: An Empirical Study of Incentive Misalignment in Prover–Estimator Games. Proceedings of the AAAI Symposium Series, 9(1), 43–51. https://doi.org/10.1609/aaaiss.v9i1.42904
Issue
Section
AI-Driven Resilience: Building Robust, Adaptive Technologies for a Dynamic World (Full Papers)