Reward-on-the-Line: A Novel Offline Reinforcement Learning Method for Building Legal Conversational Agents
DOI:
https://doi.org/10.1609/aies.v8i2.36657Abstract
Offline reinforcement learning (RL) offers a promising path for training domain-specific conversational agents (CAs) using large-scale historical dialogue data, without the need for costly online interactions or human annotations. In the legal domain, vast amounts of publicly available courtroom transcripts provide a rich and underutilized resource for developing intelligent legal CAs. However, offline training suffers from distribution shift between the learned policy and the behavior policy embedded in the training data, which can degrade agent performance at deployment. We address this challenge with a novel offline RL method, Reward-on-the-Line (ROL), which calibrates rewards based on action-selection agreement among an ensemble of CAs. We apply ROL to the U.S. Supreme Court dataset to demonstrate its effectiveness in learning proactive, legally-informed dialogue strategies from historical court proceedings. To show the broader applicability of our approach, we also evaluate ROL on the CraigslistBargain negotiation dataset. Results in both domains confirm that ROL reduces distribution shift and improves agent performance in unseen dialogue scenarios.Downloads
Published
2025-10-15
How to Cite
Lin, X., Wang, M., Yang, G. H., & Chen, D. (2025). Reward-on-the-Line: A Novel Offline Reinforcement Learning Method for Building Legal Conversational Agents. Proceedings of the AAAI ACM Conference on AI, Ethics, and Society, 8(2), 1575–1584. https://doi.org/10.1609/aies.v8i2.36657