Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
DOI:
https://doi.org/10.1609/aaai.v40i38.40542Abstract
Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly training on offline preference data to align with human preferences. During DPO training, the reference model serves as a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the absence of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and requires stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that improves preference optimization by introducing a guiding reference model. This reference model provides foresight into the desired policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on the AlpacaEval 2 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.Published
2026-03-14
How to Cite
Pan, J., Shen, W., Huang, S., Zhou, Q., & Zhang, Y. (2026). Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32646–32654. https://doi.org/10.1609/aaai.v40i38.40542
Issue
Section
AAAI Technical Track on Natural Language Processing III