Simulated Rewards, Skewed Strategies: Tracing the Acquired Preference Bias in LLM-Based Dialogue Planners
DOI:
https://doi.org/10.1609/aaai.v40i26.39348Abstract
Large language models have enabled sophisticated dialogue planning policy, but their reliance on LLM-generated simulation and feedback for policy optimization may introduce systematic preference bias. We present the first comprehensive analysis of preference bias in LLM-based dialogue planners, evaluating four state-of-the-art planning policies across three dialogue domains using multiple LLM families at varying scales. Our investigation reveals that all tested planners exhibit significant preference bias, systematically favoring narrow strategy sets rather than maintaining balanced distributions. User simulation emerges as the primary bias driver, while diverse persona simulation fails as an effective mitigation strategy. Most concerning, preference bias drives planners toward ethically problematic strategies that achieve short-term success while undermining real-world effectiveness and ethical standards. Our findings establish fundamental challenges for responsible deployment of LLM-based dialogue systems and provide crucial insights for developing more reliable and ethically-aligned planning approaches.Downloads
Published
2026-03-14
How to Cite
Huang, H., Yang, Y., Sun, H., Li, J., & Gao, Y. (2026). Simulated Rewards, Skewed Strategies: Tracing the Acquired Preference Bias in LLM-Based Dialogue Planners. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 21948–21956. https://doi.org/10.1609/aaai.v40i26.39348
Issue
Section
AAAI Technical Track on Machine Learning III