Preference Optimization via Contrastive Divergence: Your Policy Is Secretly an NLL Estimator
DOI:
https://doi.org/10.1609/aaai.v40i44.41060Abstract
Existing studies on preference optimization (PO) have been focused on constructing pairwise preference data following simple heuristics, such as maximizing the margin between chosen and rejected responses based on human (or AI) ratings. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample rejected responses. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose a sampling-based solution to estimate its normalization constant via contrastive divergence. We show that these estimative samples can act as rejected responses in PO. Leveraging the connection established between PO and NLL estimation, we propose a novel PO algorithm, called Monte-Carlo-based PO (MC-PO), that applies a MC kernel to sample *hard negatives* w.r.t.~the log-likelihood of the target policy. Intuitively, these hard negatives represent the rejected samples that are most difficult for the current policy to differentiate. We show that MC-PO outperforms existing SOTA baselines on popular alignment benchmarks.Downloads
Published
2026-03-14
How to Cite
Chen, Z., Liu, F., Zhu, X., Wang, H., Li, J., Qi, Y., & Ghavamzadeh, M. (2026). Preference Optimization via Contrastive Divergence: Your Policy Is Secretly an NLL Estimator. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37286–37294. https://doi.org/10.1609/aaai.v40i44.41060
Issue
Section
AAAI Special Track on AI Alignment