RMO: Towards Better LLM Alignment via Reshaping Reward Margin Distributions

Authors

  • Yanchi Ru Xi'an Jiaotong University
  • Yue Huang University of Notre Dame
  • Xiangliang Zhang University of Notre Dame

DOI:

https://doi.org/10.1609/aaai.v40i39.40565

Abstract

Large Language Models (LLMs) have achieved remarkable success in instruction-following and dialogue tasks, yet aligning them with human preferences remains a critical challenge. Recent advances such as Direct Preference Optimization (DPO) simplify the alignment pipeline by bypassing explicit reward modeling, but they often suffer from suboptimal reward margin distributions, leading to weak supervision signals and reduced discriminative capacity. In this work, we propose Reward Margin Optimization (RMO), a framework that reshapes reward margin distributions during training to improve alignment performance. RMO comprises three components: (1) a Dual Denoising Filtering strategy that filters ambiguous and noisy preference pairs based on reward margin dynamics; (2) Batch Margin Diversification, which maximizes intra-batch margin variance to enhance learning signal diversity; and (3) Pairwise Margin Amplification, an auxiliary regularization term that encourages larger margins between preferred and dispreferred responses. Extensive experiments on multiple LLMs and datasets demonstrate that RMO consistently improves win rates over strong baselines such as DPO and SimPO, while remaining compatible with various preference-based optimization methods. Our results highlight the critical role of reward margin distribution in preference alignment and establish RMO as an effective and scalable enhancement to existing alignment techniques.

Downloads

Published

2026-03-14

How to Cite

Ru, Y., Huang, Y., & Zhang, X. (2026). RMO: Towards Better LLM Alignment via Reshaping Reward Margin Distributions. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 32851-32859. https://doi.org/10.1609/aaai.v40i39.40565

Issue

Section

AAAI Technical Track on Natural Language Processing IV