RAGG: Retrieval-Augmented Grasp Generation Model

Authors

  • Zhenhua Tang Hefei University of Technology
  • Bin Zhu Singapore Management University
  • Yanbin Hao University of Science and Technology of China
  • Chong-Wah Ngo Singapore Management University
  • Richang Hong Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v39i7.32786

Abstract

Intent-based grasp generation inherently involves challenges such as manipulation ambiguity and modality gaps. To address these, we propose a novel Retrieval-Augmented Grasp Generation model (RAGG). Our key insight is that when humans manipulate new objects, they initially mimic the interaction patterns observed in similar objects, then progressively adjust hand-object contact. Consequently, we develop RAGG as a two-stage approach, encompassing retrieval-guided generation and structurally stable grasp refinement. In the first stage, we propose a Retrieval-Augmented Diffusion Model (ReDim), which identifies the most relevant interaction instance from a knowledge base to explicitly guide grasp generation, thereby mitigating ambiguity and bridging modality gaps to ensure semantically correct manipulation. In the second stage, we introduce a Progressive Refinement Network (PRN) with Kolmogorov-Arnold Network (KAN) layers to refine the generated coarse grasp, employing a Structural Similarity Index loss to constrain the spatial relationship between the hand and the object, thus ensuring the stability of the grasp. Extensive experiments on the OakInk and GRAB benchmarks demonstrate that RAGG achieves superior results compared to state-of-the-art approach, indicating not only better physical feasibility and controllability but also strong generalization and interpretability for unseen objects.

Downloads

Published

2025-04-11

How to Cite

Tang, Z., Zhu, B., Hao, Y., Ngo, C.-W., & Hong, R. (2025). RAGG: Retrieval-Augmented Grasp Generation Model. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 7311-7319. https://doi.org/10.1609/aaai.v39i7.32786

Issue

Section

AAAI Technical Track on Computer Vision VI