MedGR2: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Authors

  • Weihai Zhi Guangdong Institute of Intelligence Science and Technology, Zhuhai, China
  • Jiayan Guo School of Intelligence Science and Technology, Peking University, Beijing, China
  • Shangyang Li Guangdong Institute of Intelligence Science and Technology, Zhuhai, China

DOI:

https://doi.org/10.1609/aaai.v40i34.40125

Abstract

The application of vision-language models in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised fine-tuning on existing datasets often leads to poor generalization on unseen modalities and tasks, while reinforcement learning, a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To address this challenge, we propose a Generative Reward Learning framework that establishes a self-improving training cycle. The framework jointly develops a data generator and a reward model, enabling the automated and continuous creation of high-quality multimodal medical data that serves as an effective training source for post-training. Our experiments demonstrate that supervised fine-tuning using the generated data already surpasses models trained on large-scale human-curated datasets. More importantly, when the generated data is further leveraged for reinforcement learning via Group Relative Policy Optimization, the resulting model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized reinforcement-learning-based methods. Notably, a compact model trained under this framework attains performance competitive with foundation models containing more than an order of magnitude more parameters. These results suggest a new paradigm for data-efficient learning in high-stakes medical domains, shifting the bottleneck from data scarcity to data generation and unlocking the potential of reinforcement learning for building robust and generalizable medical AI systems.

Downloads

Published

2026-03-14

How to Cite

Zhi, W., Guo, J., & Li, S. (2026). MedGR2: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 28901–28909. https://doi.org/10.1609/aaai.v40i34.40125

Issue

Section

AAAI Technical Track on Machine Learning XI