Bootstrapping LLMs via Preference-Based Policy Optimization

Authors

  • Chen Jia SI-TECH Information Technology

DOI:

https://doi.org/10.1609/aaai.v40i37.40388

Abstract

Bootstrapping large language models (LLMs) via preference-based policy optimization enables aligning model behavior with human preferences while reducing reliance on extensive manual annotations. We propose a novel preference-based policy optimization (PbPO) framework that formulates learning as a min-max game between the LLM policy and a reward model (RM). The RM is constrained within a confidence set derived from collected preferences to ensure reliable exploitation, while simultaneously promoting robust exploration. Our iterative online algorithm actively collects new preference data from the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees, establishing high-probability regret bounds for both sequence-level and token-level RMs. Extensive experiments across five benchmark datasets demonstrate that PbPO consistently outperforms state-of-the-art preference optimization methods.

Downloads

Published

2026-03-14

How to Cite

Jia, C. (2026). Bootstrapping LLMs via Preference-Based Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31256–31264. https://doi.org/10.1609/aaai.v40i37.40388

Issue

Section

AAAI Technical Track on Natural Language Processing II