Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu; Zhihao Zhang; Qiaole Dong; Zhiheng Xi; Jun Zhao; Senjie Jin; Xiaoran Fan; Yuhao Zhou; Huijie Lv; Ming Zhang; Yanwei Fu; Qin Liu; Songyang Zhang; Qi Zhang

doi:10.1609/aaai.v40i40.40687

Authors

Mingqi Wu Fudan University
Zhihao Zhang Fudan University Shanghai Artificial Intelligence Laboratory
Qiaole Dong Fudan University
Zhiheng Xi Fudan University
Jun Zhao Fudan University
Senjie Jin Fudan University
Xiaoran Fan Fudan University
Yuhao Zhou Fudan University
Huijie Lv Fudan University Shanghai Artificial Intelligence Laboratory
Ming Zhang Fudan University
Yanwei Fu Fudan University
Qin Liu University of California, Davis
Songyang Zhang Shanghai Artificial Intelligence Laboratory
Qi Zhang Fudan University Shanghai Artificial Intelligence Laboratory Shanghai Key Lab of Intelligent Information Processing

DOI:

https://doi.org/10.1609/aaai.v40i40.40687

Abstract

Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model’s performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information