SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences

Authors

  • Arpan Mukherjee Imperial College London
  • Marcello Bullo Imperial College London
  • Deniz Gündüz Imperial College London

DOI:

https://doi.org/10.1609/aaai.v40i44.41110

Abstract

Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed *SharedRep-RLHF*. At its core, SharedRep-RLHF learns and leverages *shared preference traits* in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep RLHF. Experiments across diverse natural language tasks showcase the effectiveness of ShareRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.

Published

2026-03-14

How to Cite

Mukherjee, A., Bullo, M., & Gündüz, D. (2026). SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37747–37755. https://doi.org/10.1609/aaai.v40i44.41110

Issue

Section

AAAI Special Track on AI Alignment