MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia; Jinghui Lu; Haiyang Yu; Siqi Wang; Guozhi Tang; An-Lan Wang; Weijie Yin; Dingkang Yang; Yuxiang Nie; Bin Shan; Hao Feng; Irene Li; Kun Yang; Han Wang; Jingqun Tang; Teng Fu; Changhong Jin; Chao Feng; Xiaohui Lv; Can Huang

doi:10.1609/aaai.v40i37.40391

Authors

Weitao Jia ByteDance Inc.
Jinghui Lu ByteDance Inc.
Haiyang Yu ByteDance Inc. Fudan University
Siqi Wang ByteDance Inc.
Guozhi Tang ByteDance Inc.
An-Lan Wang ByteDance Inc.
Weijie Yin ByteDance Inc.
Dingkang Yang ByteDance Inc.
Yuxiang Nie ByteDance Inc.
Bin Shan ByteDance Inc.
Hao Feng ByteDance Inc.
Irene Li University of Tokyo
Kun Yang Fudan University
Han Wang ByteDance Inc.
Jingqun Tang ByteDance Inc.
Teng Fu Fudan University
Changhong Jin University College Dublin
Chao Feng ByteDance Inc.
Xiaohui Lv ByteDance Inc.
Can Huang ByteDance Inc.

DOI:

https://doi.org/10.1609/aaai.v40i37.40391

Abstract

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this,we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information