Time-Frequency Token Advantage Clipping for Training Efficient Large Reasoning Model

Authors

  • Rong Bao College of Computer Science and Artificial Intelligence, Fudan University, China
  • Bo Wang College of Computer Science and Artificial Intelligence, Fudan University, China
  • Xiao Wang College of Computer Science and Artificial Intelligence, Fudan University, China
  • Hongyu Li Independent Researcher
  • Rui Zheng College of Computer Science and Artificial Intelligence, Fudan University, China
  • Leszek Rutkowski Systems Research Institute of the Polish Academy of Sciences, AGH University of Krakow, 30-059 Kraków and the SAN University, 90-113, Łódź, Poland
  • Qi Zhang College of Computer Science and Artificial Intelligence, Fudan University, China
  • Liang Ding The University of Sydney, Sydney, Australia
  • Dacheng Tao Generative AI Lab, College of Computing and Data Science, Nanyang Technological University, Singapore

DOI:

https://doi.org/10.1609/aaai.v40i36.40253

Abstract

Long Chain-of-Thought (CoT) reasoning enhances large reasoning models' performance but suffers from severe inefficiencies, as models often overthink simple problems or underthink complex ones. Current sequence-level optimizations, like length penalties, are too coarse-grained to distinguish core logic from verbose language, precluding the necessary token-level control for efficient reasoning CoT. To overcome these limitations, we introduce Time-Frequency token Advantage Clipping (TFAC), a novel training framework designed to build efficient large reasoning models via token-level interventions. Specifically, TFAC functions along two dimensions: 1) The Frequency Dimension: It discourages inefficient loops and encourages deeper exploration by dynamically reducing the advantage scores of high-entropy tokens that are repeatedly generated within a single reasoning path. 2) The Time Dimension: It reduces excessive overthinking of the system by establishing a historical baseline for the occurrence count of each critical token in previously successful trajectories, and clipping the advantages of tokens that exceed this baseline during training. Crucially, to preserve the model's exploratory capabilities on novel problems, this suppression mechanism is automatically disabled when no historical record of success is available. Experiments conducted on the Deepseek-Distill-32B and Qwen3-8B models show that TFAC outperforms leading baseline methods, improving performance by 2.3 and 3.1 percentage points, respectively, while simultaneously reducing inference costs by 35% and 28% in scenarios where correct answers are generated. These results validate the significant efficacy of TFAC in training large reasoning models that are both powerful and highly efficient.

Downloads

Published

2026-03-14

How to Cite

Bao, R., Wang, B., Wang, X., Li, H., Zheng, R., Rutkowski, L., … Tao, D. (2026). Time-Frequency Token Advantage Clipping for Training Efficient Large Reasoning Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30049–30057. https://doi.org/10.1609/aaai.v40i36.40253

Issue

Section

AAAI Technical Track on Natural Language Processing I