Zhang, Shuyi, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, and Xiang Wang. “Interpretable Reward Model via Sparse Autoencoder”. Proceedings of the AAAI Conference on Artificial Intelligence 40, no. 41 (March 14, 2026): 34808–34816. Accessed May 13, 2026. https://ojs.aaai.org/index.php/AAAI/article/view/40783.