Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Models
DOI:
https://doi.org/10.1609/aaai.v40i12.37937Abstract
Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization.Downloads
Published
2026-03-14
How to Cite
Wang, H., Wang, S., Zhong, Y., Yang, Z., Wang, J., Cui, Z., Yuan, J., Han, Y., Liu, M., & Ma, Y. (2026). Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 9738-9746. https://doi.org/10.1609/aaai.v40i12.37937
Issue
Section
AAAI Technical Track on Computer Vision IX