RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

Xuetao Li; Wenke Huang; Nengyuan Pan; Kaiyan Zhao; Songhua Yang; Yiming Wang; Mengde Li; Mang Ye; Jifeng Xuan; Miao Li

doi:10.1609/aaai.v40i18.38539

Authors

Xuetao Li Wuhan University
Wenke Huang Wuhan University
Nengyuan Pan Hubei University
Kaiyan Zhao Wuhan University
Songhua Yang Wuhan University
Yiming Wang University of Macau
Mengde Li Wuhan University
Mang Ye Wuhan University
Jifeng Xuan Wuhan University
Miao Li Wuhan University

DOI:

https://doi.org/10.1609/aaai.v40i18.38539

Abstract

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in a significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5× greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, paving the way for more versatile and data-efficient robotic systems.

RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information