Zhang, S. (2026) “Interpretable Reward Model via Sparse Autoencoder”, Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), pp. 34808–34816. doi: 10.1609/aaai.v40i41.40783.