Time-Frequency Token Advantage Clipping for Training Efficient Large Reasoning Model

Rong Bao; Bo Wang; Xiao Wang; Hongyu Li; Rui Zheng; Leszek Rutkowski; Qi Zhang; Liang Ding; Dacheng Tao

doi:10.1609/aaai.v40i36.40253

Authors

Rong Bao College of Computer Science and Artificial Intelligence, Fudan University, China
Bo Wang College of Computer Science and Artificial Intelligence, Fudan University, China
Xiao Wang College of Computer Science and Artificial Intelligence, Fudan University, China
Hongyu Li Independent Researcher
Rui Zheng College of Computer Science and Artificial Intelligence, Fudan University, China
Leszek Rutkowski Systems Research Institute of the Polish Academy of Sciences, AGH University of Krakow, 30-059 Kraków and the SAN University, 90-113, Łódź, Poland
Qi Zhang College of Computer Science and Artificial Intelligence, Fudan University, China
Liang Ding The University of Sydney, Sydney, Australia
Dacheng Tao Generative AI Lab, College of Computing and Data Science, Nanyang Technological University, Singapore

DOI:

https://doi.org/10.1609/aaai.v40i36.40253

Abstract

Long Chain-of-Thought (CoT) reasoning enhances large reasoning models' performance but suffers from severe inefficiencies, as models often overthink simple problems or underthink complex ones. Current sequence-level optimizations, like length penalties, are too coarse-grained to distinguish core logic from verbose language, precluding the necessary token-level control for efficient reasoning CoT. To overcome these limitations, we introduce Time-Frequency token Advantage Clipping (TFAC), a novel training framework designed to build efficient large reasoning models via token-level interventions. Specifically, TFAC functions along two dimensions: 1) The Frequency Dimension: It discourages inefficient loops and encourages deeper exploration by dynamically reducing the advantage scores of high-entropy tokens that are repeatedly generated within a single reasoning path. 2) The Time Dimension: It reduces excessive overthinking of the system by establishing a historical baseline for the occurrence count of each critical token in previously successful trajectories, and clipping the advantages of tokens that exceed this baseline during training. Crucially, to preserve the model's exploratory capabilities on novel problems, this suppression mechanism is automatically disabled when no historical record of success is available. Experiments conducted on the Deepseek-Distill-32B and Qwen3-8B models show that TFAC outperforms leading baseline methods, improving performance by 2.3 and 3.1 percentage points, respectively, while simultaneously reducing inference costs by 35% and 28% in scenarios where correct answers are generated. These results validate the significant efficacy of TFAC in training large reasoning models that are both powerful and highly efficient.

Time-Frequency Token Advantage Clipping for Training Efficient Large Reasoning Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information