Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs

Yi Khuen Chai

doi:10.1609/aaai.v40i48.42313

Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs

Authors

Yi Khuen Chai Singapore Management University

DOI:

https://doi.org/10.1609/aaai.v40i48.42313

Abstract

The scaling parameter β in Direct Preference Optimization governs a fundamental trade-off: low β produces weak gradients that fail to learn from ambiguous preferences, while high β amplifies updates and causes excessive drift from the reference policy. Prior work treats β as fixed or scheduled throughout training. We introduce DualLoop-DPO, which modulates β via dual feedback: a fast loop raises β temporarily on high-uncertainty batches to enforce stronger preference margins, while a slow loop uses EMA-smoothed KL tracking to regulate policy drift. Experiments on preference alignment benchmarks show consistent improvements over existing static-β, β-scheduling, and dynamic-β baselines. These findings suggest that dual-loop β control—responding to uncertainty for learning and divergence for stability—offers a promising direction for preference-based fine-tuning.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Chai, Y. K. (2026). Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(48), 41480–41482. https://doi.org/10.1609/aaai.v40i48.42313

Download Citation

Issue

Vol. 40 No. 48: EAAI-26 AI for Education, Model AI Assignments, AAAI-26 Emerging Trends, Doctoral Consortium, Student Abstracts, Undergraduate Consortium and Demonstrations

Section

AAAI Undergraduate Consortium

Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information