TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization
DOI:
https://doi.org/10.1609/aaai.v40i44.41079Abstract
Reinforcement learning (RL) has emerged as a powerful framework to improve the reasoning performance of large language models (LLMs), with approaches such as Group Relative Policy Optimization (GRPO) showing promising results. However, GRPO and its variants struggle with collapsed groups (i.e., all-correct or all-incorrect completions), leading to zero-variance rewards and ineffective gradient signals. Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. TAPO incorporates three core techniques: (1) Dynamic Teacher Injection (DTI), which selectively injects high-quality or adversarial examples to restore effective gradient signals in collapsed groups; (2) Perturbed Answer Injection (PAI), which makes partially correct completions to provide contrastive supervision separating reasoning correctness but wrong answer from the trajectories; and (3) InfoLen-Aware Reward Shaping, a fine-grained reward strategy that penalizes outputs based on both length and semantic redundancy, encouraging concise yet informative responses. Extensive experimental results demonstrate that TAPO significantly improves the mathematical reasoning capabilities of LLMs across multiple challenging benchmarks, outperforming the GRPO baseline by a substantial margin. Component-wise ablations further validate the contribution of each proposed technique.Downloads
Published
2026-03-14
How to Cite
Jiang, M., Wang, Z., Wang, Q., Búš, P., Cheng, M., Wang, Y., Liu, Q., Li, R., Zeng, P., Liu, R., Liang, A., Xu, Y., Hu, Y., Zhang, C., & Dong, Z. (2026). TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37462-37471. https://doi.org/10.1609/aaai.v40i44.41079
Issue
Section
AAAI Special Track on AI Alignment