PSPO: Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization

Authors

  • Xinxin Zhu College of Computer Science and Software Engineering, Shenzhen University, China Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
  • Ying He College of Computer Science and Software Engineering, Shenzhen University, China
  • Haowen Hou Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
  • Ruichong Zhang Tsinghua University, China
  • Nianbo Zeng College of Computer Science and Software Engineering, Shenzhen University, China Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
  • Yulin Peng College of Computer Science and Software Engineering, Shenzhen University, China
  • Jiongfeng Fang Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
  • F. Richard Yu School of Information Technology, Carleton University, Canada

DOI:

https://doi.org/10.1609/aaai.v40i34.40157

Abstract

Reinforcement Fine-tuning (RFT) methods such as Group Relative Policy Optimization (GRPO) have demonstrated strong capabilities in aligning Large Language Models with human preferences. However, these approaches often suffer from limited data efficiency, necessitating extensive on-policy rollouts to maintain competitive performance. We propose PSPO (Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization), a lightweight yet effective enhancement to GRPO that improves training stability and sample efficiency through two complementary techniques. First, we introduce an experience-weighted reward smoothing mechanism, which uses exponential moving averages to track group-level reward statistics for each prompt. This enables more stable advantage estimation across training steps without storing entire trajectories, allowing the model to capture historical reward trends in a lightweight and memory-efficient manner. Second, we adopt a prompt-level prioritized sampling strategy, which is an online data selection method inspired by prioritized experience replay. It dynamically emphasizes higher-impact prompts based on their relative advantages, thereby improving data efficiency. Experiments on multiple mathematical reasoning benchmarks and models show that PSPO achieves comparable or better accuracy than GRPO, while significantly accelerating convergence, and maintaining low computational and memory overhead.

Downloads

Published

2026-03-14

How to Cite

Zhu, X., He, Y., Hou, H., Zhang, R., Zeng, N., Peng, Y., … Yu, F. R. (2026). PSPO: Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 29186–29194. https://doi.org/10.1609/aaai.v40i34.40157

Issue

Section

AAAI Technical Track on Machine Learning XI