TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization

Authors

  • Maowei Jiang Tsinghua University, Shenzhen International Graduate School, China
  • Zihang Wang University of Chinese Academy of Sciences, Beijing, China
  • Qi Wang Tsinghua University, Shenzhen International Graduate School, China
  • Peter Búš Tsinghua University, Shenzhen International Graduate School, China
  • Moquan Cheng China Mobile Communications Corporation, China
  • Yifan Wang The Chinese University of Hong Kong, Shenzhen
  • Quangao Liu University of Chinese Academy of Sciences, Beijing, China
  • Ruiqi Li University of Chinese Academy of Sciences, Beijing, China
  • Pengyu Zeng Tsinghua University, Shenzhen International Graduate School, China
  • Ruikai Liu University of Chinese Academy of Sciences, Beijing, China
  • Alan Liang University of Chinese Academy of Sciences, Beijing, China
  • Yansong Xu University of Chinese Academy of Sciences, Beijing, China
  • Yusong Hu Tsinghua University, Shenzhen International Graduate School, China
  • Chaoran Zhang Tsinghua University, Shenzhen International Graduate School, China
  • Zhiyong Dong Tsinghua University, Shenzhen International Graduate School, China

DOI:

https://doi.org/10.1609/aaai.v40i44.41079

Abstract

Reinforcement learning (RL) has emerged as a powerful framework to improve the reasoning performance of large language models (LLMs), with approaches such as Group Relative Policy Optimization (GRPO) showing promising results. However, GRPO and its variants struggle with collapsed groups (i.e., all-correct or all-incorrect completions), leading to zero-variance rewards and ineffective gradient signals. Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. TAPO incorporates three core techniques: (1) Dynamic Teacher Injection (DTI), which selectively injects high-quality or adversarial examples to restore effective gradient signals in collapsed groups; (2) Perturbed Answer Injection (PAI), which makes partially correct completions to provide contrastive supervision separating reasoning correctness but wrong answer from the trajectories; and (3) InfoLen-Aware Reward Shaping, a fine-grained reward strategy that penalizes outputs based on both length and semantic redundancy, encouraging concise yet informative responses. Extensive experimental results demonstrate that TAPO significantly improves the mathematical reasoning capabilities of LLMs across multiple challenging benchmarks, outperforming the GRPO baseline by a substantial margin. Component-wise ablations further validate the contribution of each proposed technique.

Published

2026-03-14

How to Cite

Jiang, M., Wang, Z., Wang, Q., Búš, P., Cheng, M., Wang, Y., Liu, Q., Li, R., Zeng, P., Liu, R., Liang, A., Xu, Y., Hu, Y., Zhang, C., & Dong, Z. (2026). TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37462-37471. https://doi.org/10.1609/aaai.v40i44.41079

Issue

Section

AAAI Special Track on AI Alignment