[1]
J. Zhou, J. Ji, J. Dai, and Y. Yang, “Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback”, AAAI, vol. 39, no. 26, pp. 27765–27773, Apr. 2025.