Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Models

Hanqing Wang; Shaoyang Wang; Yiming Zhong; Zemin Yang; Jiamin Wang; Zhiqing Cui; Jiahao Yuan; Yifan Han; Mingyu Liu; Yuexin Ma

doi:10.1609/aaai.v40i12.37937

Authors

Hanqing Wang The Hong Kong University of Science and Technology (GZ) Shanghai AI Laboratory
Shaoyang Wang National University of Singapore
Yiming Zhong ShanghaiTech University
Zemin Yang ShanghaiTech University
Jiamin Wang ShanghaiTech University
Zhiqing Cui Nanjing University of Information Science and Technology
Jiahao Yuan East China Normal University
Yifan Han Institute of automation, Chinese Academy of Sciences
Mingyu Liu Zhejiang University Shanghai AI Laboratory
Yuexin Ma ShanghaiTech University

DOI:

https://doi.org/10.1609/aaai.v40i12.37937

Abstract

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization.

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information