Preference Optimization via Contrastive Divergence: Your Policy Is Secretly an NLL Estimator

Zhuotong Chen; Fang Liu; Xuan Zhu; Haozhu Wang; Jiayu Li; Yanjun Qi; Mohammad Ghavamzadeh

doi:10.1609/aaai.v40i44.41060

Authors

Zhuotong Chen Amazon
Fang Liu Amazon
Xuan Zhu Amazon
Haozhu Wang Amazon
Jiayu Li Amazon
Yanjun Qi Amazon
Mohammad Ghavamzadeh Amazon

DOI:

https://doi.org/10.1609/aaai.v40i44.41060

Abstract

Existing studies on preference optimization (PO) have been focused on constructing pairwise preference data following simple heuristics, such as maximizing the margin between chosen and rejected responses based on human (or AI) ratings. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample rejected responses. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose a sampling-based solution to estimate its normalization constant via contrastive divergence. We show that these estimative samples can act as rejected responses in PO. Leveraging the connection established between PO and NLL estimation, we propose a novel PO algorithm, called Monte-Carlo-based PO (MC-PO), that applies a MC kernel to sample *hard negatives* w.r.t.~the log-likelihood of the target policy. Intuitively, these hard negatives represent the rejected samples that are most difficult for the current policy to differentiate. We show that MC-PO outperforms existing SOTA baselines on popular alignment benchmarks.

Preference Optimization via Contrastive Divergence: Your Policy Is Secretly an NLL Estimator

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information