[1]
W. B. Knox, “Learning Optimal Advantage from Preferences and Mistaking It for Reward”, AAAI, vol. 38, no. 9, pp. 10066-10073, Mar. 2024.