Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Authors

  • Mingqi Wu Fudan University
  • Zhihao Zhang Fudan University Shanghai Artificial Intelligence Laboratory
  • Qiaole Dong Fudan University
  • Zhiheng Xi Fudan University
  • Jun Zhao Fudan University
  • Senjie Jin Fudan University
  • Xiaoran Fan Fudan University
  • Yuhao Zhou Fudan University
  • Huijie Lv Fudan University Shanghai Artificial Intelligence Laboratory
  • Ming Zhang Fudan University
  • Yanwei Fu Fudan University
  • Qin Liu University of California, Davis
  • Songyang Zhang Shanghai Artificial Intelligence Laboratory
  • Qi Zhang Fudan University Shanghai Artificial Intelligence Laboratory Shanghai Key Lab of Intelligent Information Processing

DOI:

https://doi.org/10.1609/aaai.v40i40.40687

Abstract

Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model’s performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

Downloads

Published

2026-03-14

How to Cite

Wu, M., Zhang, Z., Dong, Q., Xi, Z., Zhao, J., Jin, S., … Zhang, Q. (2026). Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 33944–33952. https://doi.org/10.1609/aaai.v40i40.40687

Issue

Section

AAAI Technical Track on Natural Language Processing V