ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Authors

  • Zirui Song Mohamed bin Zayed University of Artificial Intelligence
  • Guangxian Ouyang Mohamed bin Zayed University of Artificial Intelligence
  • Mingzhe Li ByteDance
  • Yuheng Ji Institute of Automation, Chinese Academy of Sciences
  • Chenxi Wang Mohamed bin Zayed University of Artificial Intelligence
  • Zixiang Xu Mohamed bin Zayed University of Artificial Intelligence
  • Zeyu Zhang The Australian National University
  • Xiaoqing Zhang Renmin University of China
  • Qian Jiang Mohamed bin Zayed University of Artificial Intelligence
  • Fengxian Ji Mohamed bin Zayed University of Artificial Intelligence
  • Zhenhao Chen Mohamed bin Zayed University of Artificial Intelligence
  • Zhongzhi Li Institute of Automation, Chinese Academy of Sciences
  • Xiuying Chen Mohamed bin Zayed University of Artificial Intelligence

DOI:

https://doi.org/10.1609/aaai.v40i22.38922

Abstract

Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions. Experimental results show that ManipLVM-R1 achieves substantial performance gains across multiple manipulation tasks, using only 50% of the training data while achieving strong generalization to OOD scenarios. We further analyze the benefits of our reward design and its impact on task success and efficiency.

Downloads

Published

2026-03-14

How to Cite

Song, Z., Ouyang, G., Li, M., Ji, Y., Wang, C., Xu, Z., Zhang, Z., Zhang, X., Jiang, Q., Ji, F., Chen, Z., Li, Z., & Chen, X. (2026). ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 18558-18566. https://doi.org/10.1609/aaai.v40i22.38922

Issue

Section

AAAI Technical Track on Intelligent Robotics