Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan; Wei Shen; Shulin Huang; Qiji Zhou; Yue Zhang

doi:10.1609/aaai.v40i38.40542

Authors

Junshu Pan Zhejiang University School of Engineering, Westlake University Shanghai Innovation Institute
Wei Shen Independent Researcher
Shulin Huang Zhejiang University School of Engineering, Westlake University
Qiji Zhou School of Engineering, Westlake University
Yue Zhang School of Engineering, Westlake University

DOI:

https://doi.org/10.1609/aaai.v40i38.40542

Abstract

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly training on offline preference data to align with human preferences. During DPO training, the reference model serves as a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the absence of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and requires stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that improves preference optimization by introducing a guiding reference model. This reference model provides foresight into the desired policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on the AlpacaEval 2 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information