Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

Authors

  • Ruike Song Institute of Software Chinese Academy of Sciences Nankai University
  • Zeen Song Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Huijie Guo Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences
  • Wenwen Qiang Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i39.40584

Abstract

External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM’s internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.

Published

2026-03-14

How to Cite

Song, R., Song, Z., Guo, H., & Qiang, W. (2026). Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33019–33027. https://doi.org/10.1609/aaai.v40i39.40584

Issue

Section

AAAI Technical Track on Natural Language Processing IV