Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs
DOI:
https://doi.org/10.1609/aaai.v40i48.42313Abstract
The scaling parameter β in Direct Preference Optimization governs a fundamental trade-off: low β produces weak gradients that fail to learn from ambiguous preferences, while high β amplifies updates and causes excessive drift from the reference policy. Prior work treats β as fixed or scheduled throughout training. We introduce DualLoop-DPO, which modulates β via dual feedback: a fast loop raises β temporarily on high-uncertainty batches to enforce stronger preference margins, while a slow loop uses EMA-smoothed KL tracking to regulate policy drift. Experiments on preference alignment benchmarks show consistent improvements over existing static-β, β-scheduling, and dynamic-β baselines. These findings suggest that dual-loop β control—responding to uncertainty for learning and divergence for stability—offers a promising direction for preference-based fine-tuning.Downloads
Published
2026-03-14
How to Cite
Chai, Y. K. (2026). Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41480–41482. https://doi.org/10.1609/aaai.v40i48.42313
Issue
Section
AAAI Undergraduate Consortium