TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization

Maowei Jiang; Zihang Wang; Qi Wang; Peter Búš; Moquan Cheng; Yifan Wang; Quangao Liu; Ruiqi Li; Pengyu Zeng; Ruikai Liu; Alan Liang; Yansong Xu; Yusong Hu; Chaoran Zhang; Zhiyong Dong

doi:10.1609/aaai.v40i44.41079

Authors

Maowei Jiang Tsinghua University, Shenzhen International Graduate School, China
Zihang Wang University of Chinese Academy of Sciences, Beijing, China
Qi Wang Tsinghua University, Shenzhen International Graduate School, China
Peter Búš Tsinghua University, Shenzhen International Graduate School, China
Moquan Cheng China Mobile Communications Corporation, China
Yifan Wang The Chinese University of Hong Kong, Shenzhen
Quangao Liu University of Chinese Academy of Sciences, Beijing, China
Ruiqi Li University of Chinese Academy of Sciences, Beijing, China
Pengyu Zeng Tsinghua University, Shenzhen International Graduate School, China
Ruikai Liu University of Chinese Academy of Sciences, Beijing, China
Alan Liang University of Chinese Academy of Sciences, Beijing, China
Yansong Xu University of Chinese Academy of Sciences, Beijing, China
Yusong Hu Tsinghua University, Shenzhen International Graduate School, China
Chaoran Zhang Tsinghua University, Shenzhen International Graduate School, China
Zhiyong Dong Tsinghua University, Shenzhen International Graduate School, China

DOI:

https://doi.org/10.1609/aaai.v40i44.41079

Abstract

Reinforcement learning (RL) has emerged as a powerful framework to improve the reasoning performance of large language models (LLMs), with approaches such as Group Relative Policy Optimization (GRPO) showing promising results. However, GRPO and its variants struggle with collapsed groups (i.e., all-correct or all-incorrect completions), leading to zero-variance rewards and ineffective gradient signals. Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. TAPO incorporates three core techniques: (1) Dynamic Teacher Injection (DTI), which selectively injects high-quality or adversarial examples to restore effective gradient signals in collapsed groups; (2) Perturbed Answer Injection (PAI), which makes partially correct completions to provide contrastive supervision separating reasoning correctness but wrong answer from the trajectories; and (3) InfoLen-Aware Reward Shaping, a fine-grained reward strategy that penalizes outputs based on both length and semantic redundancy, encouraging concise yet informative responses. Extensive experimental results demonstrate that TAPO significantly improves the mathematical reasoning capabilities of LLMs across multiple challenging benchmarks, outperforming the GRPO baseline by a substantial margin. Component-wise ablations further validate the contribution of each proposed technique.

TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information