CR³: Boosting Compositional Reasoning in MLLMs Through Rule-Based Reinforcement Learning

Shun Qian; Bingquan Liu; Chengjie Sun; Peijin Xie; Zhen Xu; Baoxun Wang

doi:10.1609/aaai.v40i29.39680

Authors

Shun Qian Faculty of Computing, Harbin Institute of Technology
Bingquan Liu Faculty of Computing, Harbin Institute of Technology
Chengjie Sun Faculty of Computing, Harbin Institute of Technology
Peijin Xie Faculty of Computing, Harbin Institute of Technology
Zhen Xu Platform and Content Group, Tencent
Baoxun Wang Platform and Content Group, Tencent

DOI:

https://doi.org/10.1609/aaai.v40i29.39680

Abstract

Compositional reasoning is a critical capability for multimodal models, enabling systematic understanding of complex scenes through structured combinations of objects, attributes, and relations. However, existing research on this ability primarily focuses on vision-language models (VLMs, e.g., CLIP and SigLIP), with limited exploration of multimodal large language models (MLLMs). To address this gap, we introduce CR³, a novel framework that enhances compositional reasoning abilities of MLLMs via rule-based reinforcement learning. CR³ leverages rule-based rewards to optimize the MLLM's policy on systematically curated multimodal instruction-following tasks, guided by a model-adaptive dynamic task mixing strategy. Our approach boosts performance by over 19% on three compositional reasoning benchmarks, significantly outperforming supervised fine-tuning (SFT) by at least 12%. Crucially, CR³ demonstrates superior generalization by improving performance on out-of-domain benchmarks where SFT methods degrade, highlighting its effectiveness and data efficiency.

CR³: Boosting Compositional Reasoning in MLLMs Through Rule-Based Reinforcement Learning

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information