Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs

Authors

  • Yi Khuen Chai Singapore Management University

DOI:

https://doi.org/10.1609/aaai.v40i48.42313

Abstract

The scaling parameter β in Direct Preference Optimization governs a fundamental trade-off: low β produces weak gradients that fail to learn from ambiguous preferences, while high β amplifies updates and causes excessive drift from the reference policy. Prior work treats β as fixed or scheduled throughout training. We introduce DualLoop-DPO, which modulates β via dual feedback: a fast loop raises β temporarily on high-uncertainty batches to enforce stronger preference margins, while a slow loop uses EMA-smoothed KL tracking to regulate policy drift. Experiments on preference alignment benchmarks show consistent improvements over existing static-β, β-scheduling, and dynamic-β baselines. These findings suggest that dual-loop β control—responding to uncertainty for learning and divergence for stability—offers a promising direction for preference-based fine-tuning.

Published

2026-03-14

How to Cite

Chai, Y. K. (2026). Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41480–41482. https://doi.org/10.1609/aaai.v40i48.42313