Lu, S. (2026) “URPO: A Unified Reward & Policy Optimization Framework for Large Language Models”, Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), pp. 32329–32337. doi: 10.1609/aaai.v40i38.40507.