Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Authors

  • Junshu Pan Zhejiang University School of Engineering, Westlake University Shanghai Innovation Institute
  • Wei Shen Independent Researcher
  • Shulin Huang Zhejiang University School of Engineering, Westlake University
  • Qiji Zhou School of Engineering, Westlake University
  • Yue Zhang School of Engineering, Westlake University

DOI:

https://doi.org/10.1609/aaai.v40i38.40542

Abstract

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly training on offline preference data to align with human preferences. During DPO training, the reference model serves as a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the absence of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and requires stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that improves preference optimization by introducing a guiding reference model. This reference model provides foresight into the desired policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on the AlpacaEval 2 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Downloads

Published

2026-03-14

How to Cite

Pan, J., Shen, W., Huang, S., Zhou, Q., & Zhang, Y. (2026). Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32646–32654. https://doi.org/10.1609/aaai.v40i38.40542

Issue

Section

AAAI Technical Track on Natural Language Processing III